< 



AVAILABLE COPY 

(12) INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT) 



(19) World Intellectual Property Organization 

International Bureau 

(43) International Publication Date 
25 May 2001 (25.05.2001) 






minim 



(10) International Publication Number 

PCT WO 01/37147 A2 



(51) International Patent Classification 7 : G06F 17/50 

(21) International Application Number: PCT/EP00/ 10923 

(22) International Filing Date: 

3 November 2000 (03.1 1 .2000) 



(25) Filing Language: 

(26) Publication Language: 



English 
English 



(30) Priority Data: 

60/163,409 



3 November 1 999 (03 . 1 1 . 1 999) US 



(71) Applicant (for all designated States except US): ALGO- 
NOMICS NV [BE/BE]; Lode de Boningelaan 12, B-8500 
Kortrijk (BE). 

(72) Inventors; and 

(75) Inventors/Applicants (for US only): DESMET, Johan 

[BE/BE]; Lode de Boningelaan 12, B-8500 Kortrijk (BE). 



LASTERS, Ignace [BE/BE]; Stierstraat 23, B-2018 
Antwerpen (BE). VLIEGHE, Dominique [BE/BE]; 
Moorseelsestraat 288, B-8501 Kortrijk- Heule (BE)' 
BOUTTON, Carlo [BE/BE]; Lindelaan 60, B-8550 
Zwevegem (BE). STAS, Philippe [BE/BE]; Cornelis 
Peetersstraat 1, B-1830 MacheJen (BE). 

(74) Agents: BIRD, William, E. et al.; Bird Goen & Co, Vil- 
voordsebaan 92, B-3020 Winksele (BE). 

(81) Designated States (national): AE, AG, AL, AM, AT, AU 
AZ, BA, BB, BG, BR, BY, BZ, CA, CH, CN, CR, CU, CZ, 
DE, DK, DM, DZ, EE, ES, Fl, GB, GD, GE, GH, GM, HR, 
HU, ID, IL, IN, IS, JP, KE, KG, KP, KR, KZ, LC, LK, LR, 
LS, LT, LU, LV, MA, MD, MG, MK, MN, MW, MX MZ 
NO, NZ, PL, PT, RO, RU, SD, SE, SG, SI, SK, SL, TJ, TM^ 
TR, TT, TZ, UA, UG, US, UZ, VN, YU, ZA, ZW. 

(84) Designated States (regional): ARIPO patent (GH, GM, 
KE, LS, MW, MZ, SD, SL, SZ, TZ, UG, ZW), Eurasian 

[Continued on next page] 



(54) Title: APPARATUS AND METHOD FOR STRUCTURE-BASED PREDICTION OF AMINO ACID SEQUENCES 






ref 



PI 



SA 



E(p) 



. .Eeco{0). 




o 



(57) Abstract: The present .nvention provides methods and apparatus for analyzing a protein structure. The following steps are 
earned out: receiving a reference structure for a protein, whereby the said reference structure forms a representation of a 3D structure 
of the said protein, and the protein consists of a plurality of residue positions, each carrying a particular reference amino acid type 
in a specific reference conformation, and the protein residues are classified into a set of modeled residue positions and a set of fixed 
residues, the latter being included into a fixed template; substituting into the reference structure of step (A) a pattern whereby the 
said pattern consists of one or more of the modeled residue positions defined in step (A), each carrying a particular amino acid residue 
type ,n a fixed conformation, and the one or more amino acid residue types of the said pattern are replacing the corresponding amino 
acid residue types present m the reference structure; optimizing the global conformation of the reference structure of step (A) being 
substituted by the pattern of step (B), whereby a suitable protein structure optimization method based on a function allowing to assess 
the quality of a global prolan structure, or any part thereof, is used in combination with a suitable conformational search method 
and the said structure optimization methods is applied to all modeled residue positions defined in step (A) not being located at any 
of the pattern res.due pos.t.ons defined in step (B), and the pattern and template residues are kept fixed; assessing the energetic 
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compatibility of the pattern defined in step (B) within the context of the reference structure defined in step (A) being structurally 
optimized in step (C) with respect to the said pattern, by way of comparing the global energy of the substituted and optimized 
protein structure with the global energy of the non-substituted reference structure; and storing a value reflecting the said energetic 
compatibility of the pattern together with information related to the structure of the pattern in the form of an energetic compatibility 
object. 
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APPARATUS AND METHOD FOR STRUCTURE-BASED PREDICTION OF 

AMINO ACID SEQUENCES 

FIELD OF THE INVENTION 
5 The present invention relates to the field of genomics as well as to the field of 

protein engineering, in particular, an apparatus and method to determine the amino acid 
sequence space accessible to a given protein structure and to use this information (i) to 
align an amino acid sequence with a protein structure in order to identify a potential 
structural relationship between said sequence and structure, which is known in the art as 
10 fold recognition and (ii) to generate new amino acid sequences which are compatible 
with a protein structure, which is known in the art as protein design. 
BACKGROUND OF THE TNVF.NTTOrsr 

Within the field of bioinformatics there is an emerging need for so-called data 
mining tools. These tools aim at unraveling relationships that connect more elementary 

15 data components such as oligonucleotide and amino acid sequences, single-nucleotide 
polymorphisms (hereinafter referred as SNP), expression profiles, biochemical 
properties, structural and functional characterization, and so on. These relationships 
represent knowledge that can be used to build so-called knowledge databases that offer 
a structured description (e.g. via annotation mechanisms) of the more elementary data 

20 components such as above defined. Knowledge databases will increasingly play a key 
role in the biotechnology industry since they form an inference basis from which the 
behavior of biological systems can be understood, predicted and possibly controlled via 
proper biotechnological engineering. 

The need for data mining tools is further increased by the availability of rich, 

25 well annotated sequence databases which offer an in silico basis for new gene hunting 
approaches. Here the determination of sequence-to-structure relationships is playing a 
crucial role, also for those who are essentially not interested in structural information. 
Indeed, via sequence-to-structure relationships, often distant similarities with other 
proteins can be found. Such similarities are unlikely to be found by direct sequence 

30 comparisons if the sequence identity drops below the 25% level, according to 
R.F.Doolittle in Urfs and Orfs. A primer on how to analyze derived amino acid 
sequences (1986) at University Science Books, Mill Valey (California) and Sander et 
al. in Proteins (1991) 9:56-68. 
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Conversely, structure-to-sequence relationships are also rapidly 

gaining attention for several reasons. Firstly, the available experimental basis has 
recently evolved towards a mature stage, since the number of known protein structures 
has now reached over 12,000. Depending on the classification method, these structures 
5 represent about 600 to 700 unique folds, i.e. about 60 to 70% of the estimated total 
number of about 1000 possible folds adopted by naturally-occurring proteins, according 
to C. Chothia in Nature (1992) 357:543-544. Secondly, the available technological basis 
has also made a giant leap since new methods for side-chain positioning (according to 
M. Vasquez in Curr. Opin. Struct. Biol. (1996) 6:217-221), loop modeling (according to 
10 Sudarsanam et al. in Protein Sci. (1995) 4:1412-1420), inverse folding (according to 
Bowie et al. in Science (1991) 253:164-170), protein design (according to Dahiyat et al. 
in Science (1997) 278:82-87) and so on continuously appear in the literature, which 
methods also benefit from computer devices becoming more and more powerful. Third, 
the scientific community has started recognizing the potential of structure-based protein 
15 engineering as can be inferred from the growing number of companies in this field and 
the success of initiatives like the Critical Assessment of Techniques for Protein 
Structure Prediction (CASP) contests (http://PredictionCenter.llnl.gov/) and the 
structural genomics initiative (http://www.nigms.nih.gov/funding/psi.html). 

Tools which are able to link a given amino acid sequence to the correct three- 
20 dimensional (hereinafter referred as 3D) fold or compute the amino acid sequences that 
are compatible with a given protein structure are very valuable. Given the myriad of 
world-wide known sequences and structures, distributed over different databases, some 
crucial parameters for determining the success of any such method are prediction 
accuracy and computational speed. 
25 The present invention relates to a novel approach to link one or more amino acid 

sequences with one or more protein 3D structures and vice versa. More particularly, the 
invention concerns a new method preserving the accuracy of the most accurate atom- 
based modeling techniques currently known while avoiding the bottleneck-problem 
known as the "combinatorial substitution problem", thereby gaining several orders of 
30 magnitude in computational speed. 

The fundamental problem to be addressed by the present invention relates to 
multiple combined substitutions in a protein and can be illustrated by the following 
example. We shall consider a protein X for which the 3D structure has been 
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experimentally determined. This structure is called the wild-type (WT) structure and 
the amino acid sequence of protein X is called the WT sequence. We shall further 
assume that X is an enzyme used in an industrial process and that there is interest in 
constructing a modified protein X' showing a considerably higher thermal stability than 
5 the wild type X, while preserving the latter's functionality. While such problems have 
often been addressed by introducing point mutations (i.e. the substitution of single 
amino acid residues at selected positions), it has been shown that the simultaneous 
mutation of multiple amino acid residues is usually more efficient, according to 
International Application Publication No. WO 98/47089. On the other hand the latter 

10 approach, known as protein design, is extremely complicated in view of the extremely 
high number of a priori possible combinations of single residue mutations. For example 
if the aforementioned task of thermally stabilizing protein X was limited to substitutions 
in the core of a protein (i.e. the internal part of a protein where strictly buried residues 
are located) containing e.g. 15 residues, and if only 10 (preferably apolar) residue types 

15 were considered, there would still be 10 15 a priori possible solutions. Each such solution 
represents an amino acid sequence which is different from that of protein X and is, in 
principle, a candidate sequence for protein X'. In order to predict which sequences are 
more stable than the WT sequence, one could apply a fast molecular modeling program, 
predict the 3D-structure of each sequence, compute the potential or free energy of each 

20 modeled structure and store the best results. However, clearly no modeling program is 
at present fast enough to complete this task in an reasonable period of time. 

The former example illustrates the main bottleneck problem in protein design, 
which is known in the art as the "combinatorial substitution probleirr and which arises 
from the fact that different individual substitutions in a protein are, in principle, not 

25 independent as shown in FIG. la. The energy of a protein comprising a double 
substitution may not be simply calculated from the structures of two single mutants, 
because both mutated amino acid residues are "unaware" from each other in the single 
mutants, i.e. the single mutant structures lack any information about the interaction 
between both substituted amino acid residues in the double mutant. Consequently the 

30 simultaneous introduction of two or more substitutions into the protein to be modeled 
may lead to either favorable or unfavorable interactions which cannot be derived from 
the structure of single mutants. When performing such multiple substitutions, some 
information about the pair-wise interactions between the removed WT residues and the 
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introduced mutant residues should therefore be taken into account. 

On the other hand, there are also cases where combinatorial effects do not play a 
significant role, i.e. where single substitutions are independent from each other and 
where individual substitution energies are additive as shown in FIG. lb. Moreover, there 
5 is also experimental evidence, at least in some cases, that the energetic contributions of 
different individual mutations are additive according to A.Ferhst in Enzyme Structure 
and Mechanism (1985), ISBN 0-7167-1614-3. This is so for example when two 
individual substitutions are separated in space in a manner such that the pair-wise 
interaction between both residues is negligibly small for both the WT residues and the 
0 substituted residues. However this no-interaction condition is not sufficient to conclude 
that two substitutions may be considered independently. Indeed, although two residues 
do not directly interact with each other, they may still be coupled through 
conformational adaptation of residues in their vicinity according to FIG.lc. In general, 
this type of coupling depends on the "plasticity" of the protein (plasticity: the ability to 
1 5 cushion conformational changes). Non-direct, conformational coupling may for instance 
occur when two simultaneously substituted residues leave an intermediate residue in a 
situation where no conformational solution exists which leads to favorable energetic 
interactions with both mutated residues. Such an intermediate residue could be 
considered as "non-plastic" with respect to the concerned pair of substituted residues. 
20 Going back to the previous example of the enzyme X needing to be stabilized, 

this means that the modeling task cannot be performed by simply proposing a molecule 
X' comprising a set of substitutions which have each been modeled independently in the 
WT structure. On the other hand, a mere combinatorial enumeration of all possible 
global structures is extremely inefficient from a computational perspective and can be 
25 considered as undue work with respect to the primary goal of predicting an enzyme X' 
having an optimal thermal stability. 

However, several authors tackled the side-chain positioning problem in proteins 
by means of a combinatorial approach or equivalent method. For instance, the Dead- 
End Elimination (DEE) method takes into account both side-chain/template and side- 
30 chain/side-chain interactions and uses a mathematically rigorous criterion to eliminate 
side-chain rotamers which do not belong to the Global Minimum Energy Conformation 
(GMEC). Since the DEE routines usually do not converge towards a unique structure, a 
combinatorial end stage routine is needed to determine the GMEC. It was thus 
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demonstrated, namely by Gordon et al. in Fold. Des. (1999) 7:1089-1098, and 
Wernisch et al. In J. Mol. Biol. (2000) 301:713-736 that the combinatorial problem for 
multiple substitutions can be alleviated, by means of DEE, by eliminating a number of 
potential substitutions which cannot be part of the globally best solution, i.e. the most 
5 stable protein molecule. Since eliminations occur at the single or pair residue level, they 
result in a huge reduction of the search space since all combinations therewith are 
eliminated as well. Nevertheless, the DEE method is a valuable method to reduce the 
complexity of a system. The main disadvantages of the DEE method are (1) non- 
convergence towards a unique structure, as above-mentioned, (2) the inability to predict 
"second best" solutions and (3) a relatively weak performance for large systems (i.e. 
having more than about 30 residues), due to a mathematically stringent process which 
complicates the use of approximations. 

One problem to be addressed by the present invention is to develop a method 
rapidly and accurately predicting the compatibility of one or more amino acid sequences 
with one or more protein 3D-structures. The requirement of accuracy strongly pleads for 
the use of a detailed, atom-based force field (energy function) to compute the local and 
other interactions of a set of amino acid residues placed onto a 3D-structure. 
Furthermore, the above example showed that a correct positioning, using said 
interaction energies, is severely hampered by the combinatorial problem. On the other 
hand, all current methods which fully account for combinatorial effects are in conflict 
with the second requirement of a high computational speed. This constitutes the major 
dilemma in protein modeling, more specifically in fold recognition and protein design. 

In order to determine sequences that are reachable by a given protein structure 
and/or to identify the best-matching protein 3D-structure, given a particular sequence, 
one may use (1) prediction algorithms relying on statistically derived substitution 
matrices, (2) prediction algorithms based on statistical descriptors, and (3) computation 
of explicit conformational energies, corrected for solvation effects. 

In the approach using descriptors, a set of reference proteins (often referred to as 
the learning set) is used to derive properties that can statistically be used to judge the 
propensity for any given amino acid type to be present in some given structural 
environment. For example, in the original derivation of the 3D-profiles of Bowie et al. 
in Science (1991) 253:164-9, the environment of each residue is featured by (a) a buried 
side chain area, (b) a local secondary structure and (c) the polarity of the environment. 
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Using these features, the authors recognized 1 8 environmental classes. As a 
result, any given protein structure can be transformed into a linear string composed of 
the class identifiers for each position along the protein's sequence. From a database of 
proteins of known structure, a so-called 3D-to-lD scoring table is built to assign a score 
5 for finding each of the 20 naturally-occurring amino acid types associated with each of 
the environmental classes. Using this scoring table, the best sequence alignment can 
then be computed against the string of environment classes for a number of proteins. 

This procedure initiated the so-called threading approach in inverse folding 
where the compatibility of candidate sequences with a given scaffold is examined. As 
10 can be anticipated from the discretization procedure, wherein the complexity of a 3D 
environment is mapped onto a limited number of classes, the success of the threading 
approach is significantly limited according to M.J.Sippl in J.Comp.-Aided Mol. Design 
(1993) 7:473-501 and Chiu et al. in Protein Eng. (1998) 11: 749-752. 

Global sequence signatures involving many residues may be detected but, in 
15 view of its statistical nature, the threading method will not be applicable to accurately 
assess the sequence-structure compatibility implying e.g. perturbation effects wherein a 
given WT sequence is varied at a number of positions. In addition, even if a sequence is 
correctly predicted to be compatible with a given structure, the threading method itself 
is incapable of deriving an accurate 3D-structure of the sequence laid over a given 

20 protein backbone structure. 

In order to achieve accurate predictions, it is desirable to take into account the 
full complexity of the 3D environment of each amino acid residue. In the third approach 
wherein predictions are based on computational methods assessing the energetic of a 
system of protein residues, it is possible to compute the conformation of lowest energy 
25 and even the sequence of lowest energy for a system consisting of a protein backbone 
and a set of amino acid residues which are allowed to vary in conformation and/or 
amino acid type. Most often, such computations are based on advanced search 
techniques combined with a detailed model for atomic interactions. In addition to the 
bonded and non-bonded interaction terms as used in molecular mechanics force fields 
30 (e.g. the CHARMM force field of Brooks et al. in J. Comput. Chem. (1983) 4:187-217), 
these energy computations may also assess the hydrophobic effect through a suitable 
solvation potential such as e.g. the pair-wise solvation potential disclosed by Street et al. 
in Fold. Des. (1998) 3:253-258, thereby correcting for the modeling in absence of 



WO 01/37147 



PCT/EP00/10923 



explicit water molecules. 

In order to determine whether a sequence is compatible with a given protein 3D- 
structure or scaffold, one could in principle explore different alignments of the query 
sequence relative to the given protein backbone and search for alignments yielding low 
5 conformational energy, if any. However such a process suffers from several 
inconveniences. First, this process is very time consuming, especially if the given query 
sec * uence has to be scanned against a large database of protein structures, thereby 
searching for the best scaffold that is likely to match with the given query sequence. 
Indeed, for each protein a series of detailed computations need to be carried out at the 
10 atomic level: several alignments need to be explored and each time the conformation or 
sequence of lowest energy must be identified, which involves the solution of the so- 
called combinatorial problem whereby the best solution must be found out of all 
possible combinations of single-residue states. Consequently it is unlikely that such 
computations can be carried out in a high-throughput mode. Second, the alignments 
15 must allow for gaps which are often located in the loop regions. However for 
evolutionary distant proteins the precise gap locations and lengths are a priori unclear, 
which in turn necessitates multiple trial runs to match one sequence with one structure. 

In addition, some important problems simply cannot be answered by this third 
approach, such as for instance the task of computing all possible amino acid sequences 
20 that are compatible with a given protein scaffold (below a pre-set threshold, using a 
suitable scoring function), because it is at present impossible to generate all possible 
amino acid sequences and assess their compatibility within a reasonable computation 
time. 

As a summary, there is a stringent need for a fundamentally new method in 
25 order to explore the sequence space that is compatible with a given scaffold and to 
rapidly scan for structural compatibility a given query sequence against a database of 
know protein structures. 

GLOSSARY 

Throughout the foregoing and following description of the invention, the 
30 abbreviations, acronyms and technical terms used shall have the following meanings 
and definitions: 

ALA, ARG, ASN, ASP, CYS, GLN, GLU, GLY, HIS, ILE, LEU, LYS, MET, PHE, 
PRO, SER, THR, TRP, TYR, VAL: standard 3-letter code for the nati.rallv „™, mn „ 
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amino acids (or residues) named, respectively: alanine, arginine, asparagine, 
aspartic acid, cysteine, glutamine, glutamic acid, glycine, histidine, isoleucine, leucine, 
lysine, methionine, phenylalanine, proline, serine, threonine, tryptophan, tyrosine, 
valine. 

5 ASA: Accessible Surface Area 
DEE: Dead-End Elimination 
GMEC: Global Minimum Energy Conformation 
PDB : Protein Data Bank 
RMSD: Root-Mean-Square Deviation 

10 Accessible Surface Area (ASA) 

The surface area of (part of) a molecular structure, traced out by the mid-point of 
a probe sphere of radius -1.4 A rolling over the surface. 

Amino acid, amino acid residue 

Group of covalently bonded atoms for which the chemical formula can be written 

15 as ^N-CHR-COCT, where R is the side group of atoms characterizing the 

amino acid type. The central carbon atom in the main-chain portion of an amino 
acid is called the alpha-carbon (C«). The first carbon atom of a side chain, 
covalently attached to the alpha-carbon, if any, is called the beta-carbon (Cp). An 
amino acid residue is an amino acid, comprised in a chain composed of multiple 

20 residues and which can be written as -HN-CHR-CO-. 

Backbone 

See: Main chain 
Conformational space 

The multidimensional space formed by considering (the combination of) all 
25 possible conformational degrees of freedom of a protein, where one degree of 

freedom usually refers to the torsional rotation around a single covalent bond. In a 
broader sense, it also includes possible sequence variation by extending the 
definition of a rotamer to a particular amino acid type in a discrete conformation. 
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Dead-End Elimination (DEE) 

A method to reduce the conformational space of a side-chain modeling task by 
eliminating individual side-chain rotamers which are incompatible with the 
system on the basis of one or more rigorous mathematical criteria. By a variant of 
5 this method, pairs of rotamers belonging to different residues can also be 

eliminated. 
Energy function, force field 

The function, depending on the atomic positions and chemical bonds of a 
molecule, which allows to obtain a quantitative estimate of the latter's potential or 
1 0 free energy. 

Flexible side chain, rotatable side chain, modeled side chain 

A side chain modeled during performance of the method. 
Global energy 

The total energy of a global structure (e.g. the GMEC) which may be calculated in 
15 accordance with the applied energy function. 

Global minimum 

The globally optimal solution of a protein consisting of a number of elements (e.g. 
residue positions) which are a priori susceptible to variation (e.g. using the 
concept of rotamers). 

20 Global Minimum Energy Conformation, GMEC 

The global conformation of a protein which has the lowest possible global energy 
according to the applied energy function, rotamer library and template structure. 

Global structure, global conformation 

The molecular structure of a uniquely defined protein, e.g. represented as a set of 
25 cartesian co-ordinates for all atoms of the protein. 

Homology modeling 

The prediction of the molecular structure of a protein on the basis of a template 
structure. 
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Main chain, backbone 

The part of a protein consisting of one or more near-linear chains of covalently 

4 

bonded atoms, which can be written as (-NH-CH(-)-CO-) n . Herein, the Cp-atom 
is considered as a pseudo-backbone atom and is included in the template because 
5 its atomic position is unambiguously defined by the positions of the true backbone 

atoms. 
Protein Data Bank (PDB) 

The single international repository for the processing and distribution of 3-D 
macromolecular structure data primarily determined experimentally by X-ray 
1 o crystallography and nuclear magnetic resonance. 

Protein design 

That part of molecular modelling which aims at creating protein molecules with 
altered/improved properties like stability, immunogenicity, solubility and so on. 

Root-Mean-Square Deviation, RMSD 
1 5 The average quadratic deviation between the atomic positions, stored in two sets 

of co-ordinates, for a collection of atoms. 

Rotamer, rotameric state 

An unambiguous discrete structural description of an amino acid side chain, 
whether preferred or not. According to this definition, which is extended over the 
20 usual definitions by considering less preferred states, it is not required that 

rotamers be interchangeable via rotation only. 

Rotamer library 

A table or file comprising the necessary and sufficient data to define a collection 
of rotamers for different amino acid types. 

25 Rotamer/template (interaction) energy 

The sum of all atomic interactions, as calculated using the applied energy 
function, both within a given side-chain rotamer at a specific position in a protein 
structure and between the rotamer and the template. 
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Sequence space 

The multidimensional space formed by considering the combination of all (or a 
set of) possible amino acid substitutions at all (or a set of) residue positions in a 
protein. 

5 Side chain, side group 

Group of covalently bonded atoms, attached to the main chain portion of an amino 
acid (residue). 

Side-chain/side-chain pair(wise) (interaction) energy 

The sum of all atomic interactions, as calculated using the applied energy 
10 function, between two side-chain rotamers at two specific positions in a protein 

structure. 

Steepest descent energy minimization 

Simple energy minimization method, belonging to the class of gradient methods, 
only requiring the evaluation, performed in accordance with a particular energy 
15 function, of the potential energy and the forces (gradient) of a molecular structure. 

Template 

All atoms in a modelled protein structure, except those of the rotatable side 
chains. 

SUMMARY OF THE INVENTION 

20 The present invention provides a totally new approach to handle sequence-to- 

structure and structure-to-sequence relationships: 

(i) in a first aspect, the present invention comprises the systematical analysis of 
known protein structures in terms of individual contributions of single, pairs and 
occasionally also multiplets of amino acid residues to the global energy of a 

25 protein comprising any of these residues, and 

(ii) in a second aspect, the present invention comprises a method of combining the 
single, pair and multiplet contributions in order to estimate the global energy of 
a protein structure comprising a larger number of substitutions. 

A main feature of the present invention is that energetic values for single, pairs 
30 and multiplets of residues in different conformations can be pre-computed once and 
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stored on an appropriate data storage device for all, or a subset of all, known 
protein 3D-structures. These data can afterwards be retrieved from the storage device 
when needed for the calculation of global energies. The same data elements can be used 
over and again in various experiments wherein the compatibility of different sequences 
5 with the same 3D-structure is investigated. The present invention allows to (i) produce 
data in a fast and consistent way and (ii) use the produced data to calculate accurate 
global energies of proteins, which have undergone a relatively large number of 
mutations compared to the original structure. 

A central idea of the present invention is to analyze known protein structures by 
10 computing a quantitative measure reflecting the energetic compatibility of all naturally- 
occurring and synthetic amino acids of interest at each residue position of interest in the 
structure. In a similar way, the energetic compatibility of selected residue pairs and 
multiplets may be computed as well. Upon generating such compatibility data, it is 
possible to take into account conformational adaptation effects for residues surrounding 
15 the considered substituted residue(s), as if the same substitutions were introduced in 
vitro by protein engineering techniques. 

The said energetic compatibility data may be generated systematically for 
different protein structures and stored concisely using appropriate data storage devices. 
Ideally, it should be sufficient to use only pre-computed data for single residues and/or 
20 residue pairs in order to derive the compatibility of any sequence with a given protein 
structure with sufficient accuracy. In such case, it would become possible to derive 
extended lists of structure-compatible sequences as well as sequence-compatible 
structures arithmetically instead of by means of sophisticated protein modeling. 

The present invention thus first provides a method for analyzing a protein structure, 
25 the method being executable in a computer under the control of a program stored in the 
computer, and comprising the following steps: 

A. receiving a reference structure for a protein, whereby 

- the said reference structure forms a representation of a 3D structure of the said 

protein, and 

30 - the protein consists of a plurality of residue positions, each carrying a particular 

reference amino acid type in a specific reference conformation, and 

- the protein residues are classified into a set of modeled residue positions and a set 
of fixed residues, the latter being included into a fixed template; 
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B. substituting into the reference structure of step (A) a pattern, whereby 

- the said pattern consists of one or more of the modeled residue positions defined 
in step (A), each carrying a particular amino acid residue type in a fixed 
conformation, and 

5 - the one or more amino acid residue types of the said pattern are replacing the 

corresponding amino acid residue types present in the reference structure; 

C. optimizing the global conformation of the reference structure of step (A) being 
substituted by the pattern of step (B), whereby 

- a suitable protein structure optimization method based on a function allowing to 
10 assess the quality of a global protein structure, or any part thereof, is used in 

combination with a suitable conformational search method, and 

- the said structure optimization method is applied to all modeled residue positions 
defined in step (A) not being located at any of the pattern residue positions 
defined in step (B), and 

15 - the pattern and template residues are kept fixed; 

D. assessing the energetic compatibility of the pattern defined in step (B) within the 
context of the reference structure defined in step (A) being structurally optimized 
in step (C) with respect to the said pattern, by way of comparing the global 
energy of the substituted and optimized protein structure with the global energy 

20 of the non-substituted reference structure; and 

E. storing a value reflecting the said energetic compatibility of the pattern together 
with information related to the structure of the pattern in the form of an energetic 
compatibility object. 

By "fixed" in step (A), it is meant that the template structure may be varied by at 
25 most about 1 Angstrom in RMSD compared to the original structure. 

For the purpose of the rotameric embodiment of the method, such as explained 
hereinafter, the selected rotamer library may be either backbone-dependent or 
backbone-independent. A backbone-dependent rotamer library allows different rotamers 
depending on the position of the residue in the backbone, e.g. certain leucine rotamers 
30 are allowed if the position is within an a helix whereas other leucine rotamers are 
allowed if the position is not in a a helix. A backbone-independent rotamer library uses 
all rotamers of an amino acid at every position, which is usually preferred when 
considering core residues, since flexibility in the core is important. However, backbone- 
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independent libraries are computationally more expensive and thus less preferred for 
surface and boundary positions. The rotamer library may originate from an analysis of 
known protein structures or it may be generated by computational means, either on the 
basis of model peptides or specifically for the amino acids considered at each of said 

5 residue positions of said protein backbone. 

For the purpose of step (C), the best possible global conformation of the protein 
system includes any conformation that can be appreciated by a person skilled in the art 
as being closely related to the optimum. For the purpose of step (E), storage of the data 
may be either explicit or implicit, the latter being accomplished e.g. by advanced data 

10 ordering or by conventional compression methods. 

Other aspects of the present invention include: 

- a data structure comprising at least an ECO obtainable by the above-defined 
method, for instance in the form of a protein sequence which is different from any 
known protein sequence. 

15 - a nucleic acid sequence encoding a protein sequence analyzed by the method as 

above-defined. 

- an expression vector or a host cell comprising the nucleic acid sequence as above- 
defined. 

- a pharmaceutical composition comprising a therapeutically effective amount of a 
20 protein sequence generated by the method (as above-defined) and a 

pharmaceutically acceptable carrier. 

- a method of treating a disease in a mammal, comprising administering to said 
mammal a pharmaceutical composition such as above-defined 

- a database comprising a set of ECO's in the form of a data structure, e.g. stored on a 

25 computer readable medium; 

- a computing device adapted to carry out any of the methods of the present invention; 

- a computer readable data carrier having stored thereon a computer program capable 
of carrying out any of the methods of the present invention. 

The present invention, its characteristics and embodiments, will now be 
30 described with reference to the following detailed description. 
PPTPF PRESCRIPTION OF THE DRAWINGS 

-Figure 1 shows three schematic representations of part of protein structures in 

their conformations of lowest energy. 
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Figure 2 shows a schematic representation of the mechanism of pattern 
introduction (PI) and structure adaptation (SA) according to the present invention. 

Figure 3 illustrates the principle of additivity of adaptation (AA) according to 
the present invention. 

5 Figure 4 shows the result of the alignment procedure of human alpha- 

lactalbumin against a single residue GECO profile for hen lysozyme, according to the 
present invention. 

Figure 5 shows a correlation diagram between the energy obtained from the 
effective modeling of 100 different triple mutations at residues 5, 30 and 52 of the Bl 

1 0 domain of protein G and the energy calculated for the same substitutions derived on the 
basis of GECO energies, according to the present invention. 

Figure 6 shows a computer system suitable for use wth the present invention. 
DETAILED DESCRIPTION OF THE INVENTION 

As previously indicated, the present invention consists of two main aspects: first a 

1 5 new method including the ECO concept, secondly the use of such a method to assess the 
energy of global multiply substituted protein systems. 

The first aspect of the present invention will be explained namely with reference to 
figures 1 to 3 which are now described in detail. 

In Figure 1, the bold line symbolizes the backbone, whereas thin lines show 

20 residue side chains in their energetically optimal conformation. Residue positions are 
numbered from 1 to 5. Different residue types are represented by lines of different 
length. Black lines are used for the WT structure and grey lines show substituted 
residues. Dotted lines show residues before substitution or conformational adaptation. In 
Figure la, residues 1 and 5 have been substituted by larger residue types. While each 

25 single substitution would be compatible with the remainder of the structure, the double 
mutation leads to an incompatibility situation (symbolized by the dotted circle), thus 
both mutations cannot be modeled independently. Conversely, Figure lb shows a 
situation wherein two mutations are assumed to be independent: the single substitutions 
at residues 1 and 4, as well as the double mutation do not change their environment and 

30 both residues do not significantly interact with each other. Figure lc shows that two 
residues which do not directly interact may still be coupled through conformational 
adaptation: the larger residue type at residue position 2 necessitates (dotted circle at left) 
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a conformational change (indicated by an arrow) of residue 3 which is in conflict 
with residue 4 (dotted circle at right). 

In Figure 2, the same notations are used as in Figure 1 . The left-most drawing 
represents the protein of interest in its reference structure (ref). The middle drawing 

5 shows the reference structure wherein a single residue pattern p is introduced at residue 
position 1. The dotted circle illustrates a sterical conflict with residue 3. After the SA 
step (right-most drawing), residues 3, 4 and 5 have rearranged from their reference state 
(dotted lines) to a new, adapted conformation (full lines) which overcomes the said 
sterical conflict. The energy of each of these structures is represented underneath each 

10 drawing, showing that the introduction of p into the reference structure first leads to a 
high-energy "intermediate" structure which can be relaxed towards a structure of lower 
energy. The difference between the energy of the reference structure, E re f, and that of 
the substituted and adapted structure, E(p), is the ECO energy, E E co(p), which can be 
considered as a substitution energy. 

15 In Figure 3, the same notations are used as in Figure 1. Figure 3a shows an 

example wherein the AA principle is met, i.e. the same conformational adaptations 
(symbolized by the arrows) are observed for the double substitution (at right) as are 
occurring for both single substitutions (at left). Figure 3b shows an example wherein the 
AA principle is not met. 

20 Before describing the ECO concept and its possible applications, the following 

prerequisites should be explained: 

1 . a definition of applicable protein structures (see STRUCTURE DEFINITION); 

2. a definition of conformational freedom (see CONFORMATIONAL FREEDOM); 

3. a description of applicable energy models (see ENERGY FUNCTION); 
25 4. a description of preferred modeling techniques which are able to generate 

an ECO (see STRUCTURE OPTIMIZATION); and 
5. a definition of a reference structure, its relevance and some preferred options 
(see REFERENCE STRUCTURE). 

Next, the ECO concept is discussed in detail, both descriptively (see 
30 DESCRIPTION OF THE ECO CONCEPT) and mathematically (see FORMAL 
DESCRIPTION OF ECO).Then, a preferred approach to systematically derive ECO's is 
discussed (see SYSTEMATICAL PATTERN ANALYSIS (SPA) MODE). Next, some 
preferred operations on basic ECO's are elaborated (see GROUPED ECO'S). Finally, a 
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preferred method to estimate the energy of multiply substituted protein structures 
from single and double ECO and/or GECO energies is described (see ASSESSING 
GLOBAL ENERGIES FROM SINGLE AND DOUBLE (G)ECO's). 



5 STRUCTURE DEFINITION 

The present method to generate ECO's is applicable to protein molecules for which 
a 3D representation (or model or structure) is available. More specifically, the atomic 
coordinates of at least part of a protein main chain, or backbone, must be accessible. 
Preferably a fiill main chain structure and, more preferably, a full protein structure is 
available. In the event that the available protein structure is defective, some 
conventional modeling tools (e.g. the so-called spare parts approach disclosed by 
Claessens et al. in Prot. Eng. (1989) 2:335-345 and implemented e.g. by Delhaise et al. 
in J.MoLGraph. (1984) 2:103-106, or other modeling techniques such as reviewed by 
Summers et al. in Methods in Enzymology (1991) 202:156-204 ?) may be applied to 
build missing atoms or loops. Depending on the force field used (see ENERGY 
MODEL), the coordinates of at least the heavy atoms (any atom other than hydrogen), 
or heavy atoms together with polar hydrogen atoms, or all atoms, of at least a region 
around the amino acid residues of interest must be known, where the said region should 
be sufficiently large to enable accurate energetic computations. Optionally ligands, such 
as co-factors, ions, water molecules and so on, or parts thereof may be included in the 
definition of the protein structure as well. A known structure for the side chains of a 
protein of interest is not an absolute requirement since the method of the present 
invention includes a side-chain placement step. However, knowing starting coordinates 
for the said side chains may be useful when the user should decide to energy-minimize 
the starting structure or to keep some side chains in a fixed conformation. 

The origin of the 3D protein structure is relatively unimportant within the context 
of the present invention since the ECO concept remains applicable to any 3D-structure 
conforming to aforementioned criteria. Yet, the meaningfulness of generated ECO's 
(and, therefore, the usefulness of applications derived therefrom) is directly proportional 
to the quality (accuracy) of the starting structure. Therefore, a number of possible 
sources of atomic coordinates, in decreasing order of preference, are provided 
hereinafter: X-ray determined structures, NMR-determined structures, homology 
modeled structures, structures resulting from de novo structure prediction. 
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In the preferred embodiment of the present invention, some atoms - generally 
referred to as "template atoms" - may be kept fixed throughout the process of generating 
ECO's. Preferably, a template consists of all main-chain atoms, Cp-atoms and ligand 
atoms. A useful but more delicate option is to include particular side chains into the 
5 template, preferably side chains lying far enough (as determined by a skilled user) from 
the residue(s) involved in the ECO generation. 

The starting 3D-structure of a protein of interest forms a first source of input data 
for the performance of the present method. 

In the definition of a protein of interest, there are no limits to the length of the chain 
10 (provided that at least two amino acids are linked together) or to the number of chains. 
The definition of a protein as used herein also includes domains (e.g. functional, 
enzymatic or binding domains) as well as smaller fragments such as turns or loops. By 
extension, the method may also be applied to molecules such as peptides, peptido- 
mimetic structures and peptoids. 
15 With respect to the second aspect of the present invention, i.e. a method to assess 

the energetic fitness of multiply substituted proteins on the basis of values for single 
residues and pairs, a starting structure is of lesser importance. Indeed, the latter method 
preferably uses pre-computed values from a database, rather than generating them when 
needed, although the latter possibility is not excluded by the present invention. 

20 

CONFORMATIONAL FREEDOM 

The present method involves two different types of structure changes, i.e. amino 
acid substitutions and conformational changes, the latter being discussed in this section. 
The present method to generate ECO's can be executed by allowing conformational 
25 changes either (i) in a discrete rotameric space or (ii) in a continuous space. The 
rotameric approach being preferred will be described first. 

The concept of rotamers states that protein side chains (and even residue mam 
chains) can adopt only a limited number of energetically favorable conformational 
states, called rotamers, according to Ponder et al. .in J. Mol. Biol. (1987) 193:775-791. 
30 The attractive aspect of rotamers is that the theoretically continuous conformational 
space of amino acid residues can be transformed into a discrete number of 
representative states on the basis of a rotamer library (for instance the rotamer library 
described by De Maeyer et al. in Fold. & Des. (1997) 2:53-66, comprising 859 elements 
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for the 17 amino acids having rotatable side chains), thus considerably facilitating 
a thorough, systematical analysis of a protein structure. Furthermore, some energetic 
properties of individual rotamers within the context of a given protein structure can be 
pre-calculated before any modeling action takes place. The same can be done for 
5 pairwise interaction energies between rotamers of different residues. On the basis of 
individual (or "self) and mutual (or "pair") energies, a large number of rotamers and 
— rotamer combinations can be identified as being incompatible with the scaffold of the 
protein of interest. Another major advantage of the rotamer concept is that search 
methods are known in the art to find the energetically best possible global structure for 
10 the side chains of a protein, given a fixed main chain, a rotamer library and an energy 
function. Examples of such methods known in the art are the various embodiments of 
the DEE method such as disclosed by Desmet et al. in Nature (1992) 356:539-542, 
Lasters et al. in Protein Eng. (1993) 6:717-722, De Smet et al. in The Protein Folding 
problem and tertiary structure prediction (1994) 307-337, Goldstein in Biophys. Jour. 
15 (1994) 66:1335-1340 and De Maeyer et al. (cited supra). Irrespective of the search 
method, the use of rotamers allows to overcome local energetic barriers by simply 
switching between different rotameric states. 

Furthermore, it should be noted that ECO's refer to the structural compatibility of 
"patterns" that are grafted into a given protein structure. As used herein, a pattern 
20 consists, at the choice of the user, of one or more residue positions, each carrying an 
amino acid type in a certain conformation. Pattern residues need not be covalently 
linked to each other, but they must be intact (at least the whole side chain must be 
considered) and their 3D-structure must be unambiguous. In the preferred embodiment 
of the present invention, different ECO's may be generated, where each ECO is an 
25 energetic compatibility object defined on one or more residue positions, each having 
assigned a specific amino acid type and adopting a well-defined rotameric state. 

The selection of rotamers occurs on the basis of a rotamer library, for instance 
according to one of the three following possibilities: (i) the retrieval of side-chain 
rotamers from a backbone independent library, (ii) the retrieval of side-chain rotamers 

9 

30 from a backbone dependent library and (iii) the retrieval of combined main-chain/side- 
chain rotamers from a backbone dependent library. The two former possibilities 
necessarily imply that the main-chain structure of a pattern residue is taken from the 
protein starting structure. The latter possibility requires an additional comparison 
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between the local backbone conformation in the protein structure and all backbone 
conformations in the library and the retrieval of a side-chain rotamer associated with the 
closest matching backbone conformation and is therefore more complicated since it 
involves the insertion of the main-chain portion of the rotamer selected from the library 
5 with the backbone of the protein structure. This can be performed in many ways. For 
instance for isolated pattern residues, it is possible to apply a least-squares fitting of the 
peptide plains determined by the main-chain dihedral angles of the selected library 
rotamer onto those of the protein structure where the rotamer is to be inserted. For 
consecutive pattern residues (e.g. loops), the method described in Brugel (Delhaise et al. 
10 in J. Mol. Graph. (1984) 2:103-106) can be used. In both cases, the method may be 
followed by an energy refinement step using a gradient method, e.g. a steepest descent 
energy minimization. A common feature of the three possible selection methods is that 
they result in a unique definition of the 3D structure of each pattern residue, their side- 
chain conformations are selected from a rotamer library whereas their main-chain 
1 5 conformations are taken either from a library or from the protein starting structure. 

Another aspect of the present invention concerns the conformational freedom of 
that part of the protein not belonging to the pattern of interest, i.e. the "pattern 
environment". In this respect, the present method includes a step wherein the pattern 
environment is conformationally adapted to the pattern itself and which essentially aims 
20 at an energetical optimization of intramolecular interactions within the pattern 
environment and between the latter and the pattern atoms. Thus, we shall make a 
distinction between a rotameric and a non-rotameric embodiments, both subject of the 
present invention, since they have quite different characteristics. In the preferred 
rotameric embodiment, the accessible conformational freedom of the pattern 
25 environment is determined by a rotamer library. Similarly to pattern residues, the 
rotamer library used for the non-pattern residues may be either a side-chain or a 
combined main-chain/side-chain rotamer library. However, restricting the 
conformational space of the non-pattem residues to only side-chain rotamers (thus 
keeping fixed the main chain) is highly preferred because of its mathematical simplicity. 
30 In contrast to the assignment of one unique rotameric conformation to each of the 

pattern residues, the residues of the pattern environment have in principle access to all 
rotameric states known in the library for the associated residue types. The total 
conformational space is then defined as all possible combinations of rotamers for the 
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non-pattern residues. The search method described in the section STRUCTURE 
OPTIMIZATION below then needs to find the energetically most favorable 
combination. 

As will be appreciated by those skilled in the art, several techniques can be applied 
to restrict the number of rotamers for individual residues in a protein structure, based on 
criteria related to local secondary structure, local energy, solvent accessibility and so on. 
In addition, some residues of the pattern environment may be completely refrained from 
adaptation (this is equivalent to include such residues into the template, see 
STRUCTURE DEFINITION) in order to speed up computations. This however may 
lead to underestimate the structural compatibility of the pattern of interest. The user 
therefore has to decide whether the induced level of error remains acceptable in the 
application concerned. For example in a fold recognition application, where the 
compatibility of a long amino acid sequence with different protein folds is to be 
assessed, many single residues may be erroneously predicted as incompatible if the 
global score for a correct or "true" sequence-structure correlation remains 
distinguishable from the incorrect or "false" matches. It may even be preferred to 
generate ECO's while limiting adaptation to only those residues in immediate contact 
with the pattern residue(s). To the contrary, should a user be interested to test whether a 
given pattern (e.g. the binding site of a receptor, or any part thereof) can be designed 
into a scaffold of interest, then it is advisable to eliminate as many sources of error as 
possible and therefore to retain a high number of adaptive residues around the said 
pattern. As a general rule, it is suggested to apply one of many possible threshold-based 
methods directly or indirectly related to the distance between pattern and environment 
residues. For example a convenient criterion may be to include in the set of adaptive 
residues all residues j for which 

min, D(v) < S + n{f) + n (fy (, 

where i is a pattern residue, j is an environment residue, D(iJ) is preferably equal to the 
distance between the Cp-atoms of i and j in the protein structure, min, D(iJ) is the 
closest such distance for all pattern residues i, S is a predefined threshold value 
(preferably about 6 A but possibly lower or higher values for less or more critical 
applications,, respectively), f and / are the residue types at positions / and j 9 
respectively, and n(f) and n(f) are measures of the size of the side-chains at the 
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respective positions i and j, where n is preferably equal to the number of heavy 
atoms (i.e all non-hydrogen atoms) counted from the Cp-atom (included) along the 
longest branch in the side chain. For example, if i°*=Ile, /=Phe and S = 6A,j would be 
included in the set of adaptive residues if D(iJ) < 6+3+5 = 14 A. 
5 Although less preferred, the present method to generate ECO's can also be 

performed without using the rotamer concept. Again a distinction needs to be made 
" between the pattern residues and the pattern environment. With respect to the 
conformation of the one or more pattern residues, a first approach is to "copy" a sub- 
structure from a known 3D-structure and to graft it into the structure of the protein of 
10 interest using a suitable structure fitting method. This can be of interest, for example, in 
loop grafting or when attempting to substitute binding site patterns. This fully matches 
the ECO concept since the definition of a pattern requires establishing a well-defined 
set of residue positions (e.g. a loop fragment), carrying a well defined amino acid 
sequence (one residue type at each position) in a well-defined conformation (e.g. a 
1 5 known conformation). An alternative to assign a conformation to pattern residues is to 
randomly sample it from a continuous torsion space around main-chain and/or side- 
chain dihedral angles using a random number generator. This approach does form a 
feasible alternative to investigate the compatibility of different patterns occupying the 
same residue positions and types while adopting different non-rotameric conformations. 
20 The non-pattern residues form the subject of the second step of the present method, 

i.e. the adaptation step. Here also, a less preferred but feasible non-rotameric approach 
is possible. The structural adaptation of the pattern environment in a continuous 
conformational space will be driven by the selected optimization method, typically a 
gradient method, in combination with the selected energy function, typically a potential 
25 or free energy function.Here, the accessible space is in principle nearly infinite since it 
includes all possible combinations of continuous torsion variations with bond angle 
bending, bond stretching, and so on. However, in practice, most continuous 
optimization methods (such as e.g. steepest descent energy minimization) do cause only 
local changes in the structure and are prone to get trapped into local optima instead of 
30 the global optimum. On the other hand, gradual non-rotameric optimization methods 
advantageously provide a simple solution to the problem of main-chain flexibility 
because during structural optimization of a pattern environment, the protein main-chain 
can be optimized together with the side chains. 
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In conclusion, both a rotameric and non-rotameric method are suitable for both 
pattern definition and adaptation of the pattern environment. Although all combinations 
are possible and useful in particular circumstances, particularly preferred are the 
combination of rotameric patterns with non-rotameric adaptation as well as 
5 combinations of rotameric adaptation with rotameric or non-rotameric patterns of 
interest. 



ENERGY FUNCTION 

the most important element of an ECO is its energy value, which should be 
10 interpretable as a quantitative measure of the structural compatibility of the ECO pattern 
within the context of the adapted protein structure. For this reason, the notion of 
"structural compatibility" may be considered equivalent to "energetic compatibility". 

In principle, the method of the present invention may be performed with all energy 
functions commonly used in the field of protein structure prediction, provided that the 
1 5 applied energy model allows to assign an energetic value that is a pure function of the 
atomic coordinates, atom types and chemical bonds in a protein structure, or any part 
thereof. When considering rotamers it is preferred, although not required, that the 
energy function be pair-wise decomposable, i.e. the total energy of a protein structure 
can be written as a sum of single and pair-residue contributions (see equation 4 below), 
20 in order to facilitate the adaptation step in the generation of ECO's. 

Preferably, the applied energy model should approximate as close as possible the 
true thermodynamic free energy of a molecular structure. Therefore, the applied energy 
function should include as many relevant energetic contributions as possible, depending 
on the accuracy requirements of the user. Highly preferred energy functions include 
25 terms which account for at least Van der Waals interactions (including a Lennard-Jones 
potential such as disclosed by Mayo et al. in J. Prot. Chem. (1990)), hydrogen bond 
formation (such as referred by Stickle et al. in J. Mol. Biol (1992) 226:1143), 
electrostatic interactions and contributions related to chemical bonds (bond stretching, 
angle bending, torsions, planarity deviations), i.e. the conventional potential energy 
30 terms. Useful energy models including such contributions are, for example, the 
CHARMM force field of Brooks et al. (cited supra) or the AMBER force field 
disclosed by Weiner et al. in J. Am. Chem. Soc. (1984) 106:765. 

In addition to these potential energy terms, some other energetic contributions may 
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be important, given the fact that different ECO's may involve different residue types 
of varying size and chemical nature, placed at topologically different positions in a 
protein structure. Four types of energetic contributions and corrections may account for 
type- and topology-dependent effects: corrections for (1) main-chain flexibility, (2) 
5 solvation effects, (3) intra side-chain contributions and (4) the behavior of particular 
residue types in the unfolded state. 

Main-chain flexibility, or adaptability of the backbone to an inserted pattern, can 
be corrected by reducing the sensitivity of the terms related to packing effects, e.g. 
using methods including united-atom parameters such as disclosed by Lazar et al. In 
10 Protein Sci. (1997) 6:1167-1178 or re-scaling of atomic radii such as proposed by 
Dahiyat et al. Proc. Natl. Acad. Sci. USA (1997) 94: 10172-10177. When generating 
ECO's using a gradient method in the adaptation step, main-chain variability is 
automatically taken into account by including the corresponding atoms in the 
optimization process. 

15 Because of the hydrophobic effect, solvation effects may be a crucial factor 

determining the reliability of modeled protein structures. According to the present 
invention, computing explicit protein-solvent interactions is not preferred since this is 
prohibitively slow. Instead, a compromise between computational performance and 
prediction accuracy is a method wherein the burial and exposure of polar and non-polar 
20 atoms is translated into energetic correction terms for solvation on the basis of their 
buried and exposed surface areas and of a set of atom type-specific solvation 
parameters, such as reviewed by Gordon et al. in Curr. Opin. Struct. Biol. (1999) 9:509- 
513. In the embodiment of the present invention wherein the adaptation step in the ECO 
generation occurs in a rotameric way, it is beneficial to have a pair-wise decomposable 
25 solvent function, such as proposed by Street et al. (cited supra). Also preferred for its 
computational efficiency is a method for the assignment of a set of energetic solvation 
terms to each of the considered residue types (usually the 20 naturally-occurring amino 
acids) depending on the degree of solvent exposure of their respective rotamers at the 
considered residue positions in the protein structure of interest. This method is hereafter 
30 referred to as the type-dependent, topology-specific solvation (TTS) method. In this 
method, a set of N_TYPES energetic parameters is established for each of N_CLASSES 
different classes of solvent exposure, where N_TYPES is the number of residue types 
considered (usually 20) and N_CLASSES is the number of discrete classes of solvent 
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exposure. For example, three such classes may be established, corresponding with 
buried, semi-buried and solvent-exposed rotamers, respectively. It is noteworthy that 
different rotamers for the same residue type at the same position may be assigned 
different classes. Both the number of classes and their boundaries may be subject to 
5 variation, but the following embodiment should be preferred: 

- first, each pattern element (a given residue type at a given residue position in a 
specific rotameric state) is substituted into the protein structure and its accessible 
surface area (ASA) is calculated, for instance according to Chothia in Nature (1974) 
248:338-339, 

10 - next, a class assignment occurs on the basis of the percentage ASA of the residue 
side chain in the protein structure compared to the maximal ASA (over all rotamers) 
of the same side chain being shielded from the solvent only by its own main-chain 
atoms, for instance the class of buried rotamers may be defined by ASA% < 5%, the 
class of semi-buried rotamers by 5% < ASA% < 25% and the class of solvent- 

1 5 exposed rotamers by 25% < ASA%, 

- and finally an appropriate type- and topology-specific energy, Errs(T,C), for the 
considered pattern element is retrieved from a table of TTS values, using the pattern 
type (T) and class (C) indices. 

The TTS energy thus obtained should be added as an extra term to the energy of the 
20 pattern element in the protein structure (see FORMAL DESCRIPTION OF ECO'S). 

A table of TTS values may be established by tuning the different TTS parameters 
Etts(T,C) versus a heuristic scoring function S as follows. The tuning method includes: 

- first considering a set of a number of, preferably at least 50, well-resolved protein 
structures and assigning to each residue position of this set a unique index / ( 1 < i < 

25 all residues), 

- then considering, at each residue position /, all residue types a in all rotameric states 
r, thereby forming the set of all possible position-type-rotamer combinations 

- defining by / r w the WT residue type at position i in a rotameric state r as observed in 
the protein comprising position /, 

30 - assigning to each if a solvent exposure class C(if) as described above, 

- calculating for each if the value E pot (/r) as the total potential energy the rotamer if 
experiences within the fixed context of the protein structure comprising i, preferably 
using conventional potential energy terms such as disclosed above, 
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defining a function AE(z) as 

AE(z) = min r (EpotCO + Errs(w,C(/ r w ))) - min fl min r (E^i?) + Errs(a 5 C(/ r a ))) (eq.2) 

AE(z) being interpretable as the difference in energy of the WT residue type in 
its best possible rotamer (as obtained by the min r operator) and the 
energetically most favorable residue type at residue position / also in its best 
possible rotamer (as obtained by both the min* operator and the min r operator), 
given a (as yet undefined) set of TTS parameters, and 
- applying a suitable parameter optimization method to maximize S by varying 
Etts(T,C) parameters. 

If the TTS parameters are zero, then equation (2) provides the difference in 
potential energy between the optimal WT rotamer and the energetically most favorable 
residue type at position / in its optimal rotameric state, within the context of the protein 
structure involved. The effect of assigning non-zero values to the TTS parameters is that 
the energy of a residue will be modified by its associated type- and topology-dependent 
correction term. In the ideal case, the TTS-parameters may be such that, at each position 
f, the observed WT residue becomes or approximates the energetically optimal residue 
type. In order to reach this objective, the objective scoring function S defined by the 
following equation (3) 

S = Z,(1+AE(0)- 1 ( ec *- 3 ) 

may be used, which means that if at position i the WT residue type is best (i.e. AE(z) = 
0), then S is increased by one unit. For instance, if the WT would be 1 kcal. mol 1 higher 
in energy than the best possible type, this would result in an increase of S by 0.5 unit. 

Finally, It has been observed that the function S as defined by equation (3) on 60 
TTS parameters (20 types x 3 classes) is easy to optimize by simply taking steps above 
and below the current value of each parameter and accepting only up-hill changes (i.e. 
better values) in S. Initial parameters may be set to zero and initial steps are preferably 
about 10 kcal mol* 1 , gradually reducing in size by about 75% per optimization round 
until a step size of about 0.05 kcal rnof 1 is reached, which terminates the parameter 

optimization process. 

The same strategy of parameter tuning may lead to even more relevant results if the 
potential energy calculations, E^i? ), are performed within the context of an adaptive 
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protein structure rather than a fixed structure. This can be accomplished e.g. 
by using one of the structure optimization methods described in the next section. 
Moreover, such approach is fully consistent with the ECO concept itself, wherein the 
compatibility of patterns is evaluated in an adaptive environment. A set of TTS 
5 contributions derived in this way is shown in table 3. 

A most attractive feature of the TTS method is that terms used therein will 
automatically account for two other effects which may be critical in the field of protein 
design. The first effect is the mutual bias of different residue types when comparing 
their intraside-chain energies. For example, an arginine residue may have an energetic 
10 advantage over a chemically similar lysine residue by more than 5 kcal mol" 1 (using the 
CHARMM force field) which is solely due to intraside-chain contributions. In order to 
correct for this phenomenon in the prior art, such contributions may be simply switched 
off or corrected for by subtracting a reference energy calculated as the potential energy 
of an isolated amino acid residue, according to Wodak (cited supra). However, the 

15 present invention prefers not to impose any a priori intraside-chain correction, but to 
incorporate this into the TTS contributions. The second effect relates to energetic 
aspects of the unfolded state, which is ignored when computing energies exclusively 
from the folded state of the protein. This involves complicated elements like changes in 
entropy, residual secondary and tertiary structure in the unfolded state, the existence of 

20 transient local and global states, and so on. Consequently, in a modeling environment 
where native, folded structures are examined, it is extremely difficult to predict the 
impact of such effects on the predictions. However, in as far as such effects would be 
type- and topology-dependent, it may be assumed that the TTS method implicitly takes 
them into account. For example, buried leucine residues may experience a greater loss 

25 in torsional entropy compared to isoleucine upon folding of the protein in view of the 
fact that the latter is beta-branched. 

In conclusion, the present method for generating ECO's requires a suitable energy 
function including as many relevant contributions as possible, preferably functions 
based on inter-atomic distances and bond properties. There is no specific preference 

30 with respect to explicit treatment of hydrogen atoms or not. The implementation of the 
aforementioned TTS method will be useful when ECO's are prepared for fold 
recognition or protein design, as well as for other applications. 
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STRUCTURE OPTIMIZATION 

When generating ECO's, a practical and reliable method for structure optimization 
is needed, i.e. a search method producing an optimal or nearly optimal 3D structure for 
a protein molecule, given the starting structure of the protein as previously described in 
5 STRUCTURE DEFINITION, the introduction of one or more substitutions according to 
the pattern of interest (except for the reference structure described in the section 
REFERENCE STRUCTURE below), the allowed conformational variation of the 
protein system as previously described in CONFORMATIONAL FREEDOM and the 
applied energetic model as described in ENERGY FUNCTION. 
10 ECO's are intended to measure the change in total energy when non-native patterns 

are introduced in the context of a protein structure, i.e. an ECO energy is actually a 
differential value between the total energy of a substituted protein structure and that of a 
native or "reference" protein structure. Both the substituted and the reference structures 
should be modeled in a similar and reliable way in order to allow for interpretation of an 
1 5 ECO energy as a measure of the true substitution energy associated with the pattern of 
interest. Therefore the method should guarantee that both the substituted and the non- 
substituted protein structures approximate as close as possible the structures as may be 
determined experimentally. 

Assuming a nearly constant backbone conformation may be a source of error for 
20 the most constrained substitutions: proteins comprising complicated substitutions may 
reorganize their backbone conformation or may even completely unfold. In such cases, 
ECO substitution energies predicted according to the present invention may be in error, 
but should not change the conclusion that such substitutions are probably incompatible 
with the protein backbone. 

25 The present invention has no preference with respect to the structure optimization 

(or search) method used as long as it provides a beneficial ratio of prediction 
accuracy versus computational speed. Suitable known search methods include, for 
example, Monte Carlo simulation such as disclosed by Holm et al. in Proteins 
(1992) 14:213-223, genetic algorithms such as disclosed by Tuffery et al. in J. 

30 Comput. Chem. (1993) 14:790-798, mean-field approaches such as disclosed by 

Koehi et al. in 7. Mol Biol. (1994) 239:249-275, restricted combinatorial analysis 
such as disclosed by Dunbrack et al in J. Mol Biol (1993) 230:543-574, basic 
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molecular mechanics gradient methods such as described in Stoer et al. 
in Introduction to Numerical Analysis (1980), Springer Verlag, New York and DEE 
methods such as previously disclosed. In the preferred embodiment of the present 
invention wherein the conformational freedom of individual residues is represented 
by rotamers, a yet unknown structure optimization method is however preferred 
over the previously known methods because it combines a high accuracy together 
with a high intrinsic computational efficience. This combined characteristics is 
important especially in the systematical pattern analysis (SPA) embodiment of this 
invention wherein typically hundreds of thousands of patterns per protein need to 
be analyzed. This method, hereinafter referred as the FASTER method, is a method 
for generating information related to the molecular structure of a biomolecule (e.g. 
a protein), the method being executable by a computer under the control of a 
program stored in the computer, and comprising the steps of: 

(a) receiving a 3D representation of the molecular structure of said biomolecule, 
the said representation comprising a first set of residue portions and a template; 

(b) modifying the representation of step (a) by at least one optimization cycle; 
characterized in that each optimization cycle comprises the steps of: 

(bl) perturbing a first representation of the molecular structure by modifying 

the structure of one or more of the first set of residue portions; 
(b2) relaxing the perturbed representation by optimizing the structure of one 
or more of the non-perturbed residue portions of the first set with respect 
to the one or more perturbed residue portions; 
(b3) evaluating the perturbed and relaxed representation of the 
molecular structure by using an energetic cost function and 
replacing the first representation by the perturbed and relaxed 
representation if the latter's global energy is more optimal than that 
of the first representation; and 

(c) terminating the optimization process according to step (b) when a 
predetermined termination criterion is reached; and 

(d) outputting to a storage medium or to a consecutive method a data structure 
comprising information extracted from step (b). 

In short, this method works by repeated, gradual optimization of rotameric sets of 
residues. When the protein system is near its globally optimal structure, which is the 
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case if small patterns (e.g. one or two residues) are substituted into a reference 
structure, only a small number of optimization rounds are needed to obtain the optimally 
relaxed structure. This process typically takes in the order of about 10 to 100 
milliseconds CPU time per pattern analysis (using a 300 MHz R12000 processor as a 
5 reference). The computation of the reference structure itself typically requires about 100 
to 1000 CPU-seconds per protein structure. 

REFERENCE STRUCTURE 

The reference structure is defined as the structure of lowest energy for the protein 
10 of interest, given an experimentally determined or computationally generated starting 
structure, a particular rotamer library (only in the rotameric embodiment of the present 
invention) and a selected energy function and may be obtained by a suitable search 
method (see STRUCTURE OPTIMIZATION). 

The role of the reference structure is to make an ECO energy (the value computed 
15 in step F) interpretable as a substitution energy. ECO's are generated by substituting a 
non-native pattern into an optimally relaxed reference structure, reconsidering (in the 
structure adaptation step) the optimal packing arrangement and computing the 
difference in global energy between the modified and the original reference structure. 
An important feature of the reference structure is its amino acid sequence. For 
20 instance, the protein of interest may be modified prior to computing its structure of 
lowest energy by supplying the backbone with new side chains. Thus, the user of the 
method may define a reference sequence which is different from the sequence observed 
in the starting structure of the protein. Either a few amino acid side chains or the 
complete amino acid sequence may be altered. The reference structure is then the 
25 conformation of lowest energy for the protein of interest having a user-selected 
reference amino acid sequence. The most ready reference sequence is the native amino 
acid sequence of the protein starting structure, but valuable alternatives thereto will be 
explained hereafter. 

The reference sequence should preferably approximate as close as possible the 
30 sequences that will be generated in an ECO-based application. For example, when 
interest is in thermally stabilizing mutants of a given enzyme of known sequence but 
unknown structure, and assuming that only a distantly related homologous 3D-structure 
would be available from a database of protein structures, then it is advisable to compute 
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the structure of lowest energy for the sequence of said enzyme (e.g. using the 
homologous backbone structure) rather than for the sequence of the homologous protein 
(using its own backbone structure). If the mutants of the said enzyme were built from 
ECO's generated for the homologous structure, this would involve many more 
5 individual changes (and potential sources of error) as compared to when the enzyme 
mutants are built with ECO's generated with the modeled enzyme sequence as a 
reference structure. 

However protein design and fold recognition applications may also rely on some 
general purpose databases of ECO's derived from representative protein structures or 
10 scaffolds. Furthermore, it is not an absolute requirement to use any known or existing 
reference amino acid sequence when generating ECO's. A most interesting type of 
reference sequence may be the sequence leading to the lowest possible energy for the 
protein of interest, in other words the global minimum energy sequence which may be 
computed by any suitable search method such as previously described (see 

15 STRUCTURE OPTIMIZATION), e.g. a DEE method or the FASTER method. Some 
other reference sequences, such as a polyglycine or a polyalanine sequence, may 
provide practically useful reference structures. They provide the advantage that the 
adaptation step in the ECO computation is easier, but the disadvantage that the ECO 
patterns are modeled onto an almost naked backbone. Finally, we observed 

20 experimentally that assessing multiply substituted proteins from single and double 
ECO's is surprisingly insensitive to the nature of the reference sequence, especially in 
applications which tolerate some level of approximation (like fold recognition). 
Therefore any possible reference sequence, including randomly generated and dummy 
sequences, may be practically useful in particular cases. Selecting a reference sequence 

25 forms an essential part of generating an ECO since it directly relates to the interpretation 
of the associated energy value as being a substitution energy. 

DESCRIPTION OF THE ECO CONCEPT 

An "energetic compatibility object" (ECO) is an object which basically provides a 
30 fully structured description of the results of a modeling experiment wherein a given 
pattern p of interest has been introduced and modeled into a protein reference structure. 
Usually the pattern p consists of a small number of amino acid residue types, each being 
placed at a specific residue position in the reference structure and adopting a well- 
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defined fixed conformation. In a preferred embodiment of the present invention, 
hereinafter referred as the "systematical pattern analysis" (SPA) mode, a large number 
of predefined patterns may be generated and analyzed whithin the context of the 
reference structure. The SPA mode is particularly suited to obtain large-scale results 
5 concerning the mutability of single and pairs of residues. It also provides the necessary 
input data for the analysis of multiply substituted proteins on the basis of elementary 
data components. 

The number of residues of p is usually 1 or 2 but larger patterns of interest, 
comprising e.g. a set of clustered core residues, may be considered as well. In general, 
10 the present invention does not impose restrictions to the number of residues included in 
p since by definition p is a pattern of interest to the user. However, in the preferred SPA 
mode, where ECO's are systematically generated for many different patterns, the 
number of residues of any such pattern may preferably be limited to 1 or 2. 

Besides the number of residues, an unambiguous description of p also requires 
15 establishing the exact locations of the one or more residues of p within the framework 
of the reference structure. In the present invention the allowed locations correspond 
with amino acid residue positions, which means that the said locations are defined by 
selecting, for each pattern residue, a specific amino acid residue position of interest. All 
conventional amino acid residues which are present in the protein reference structure are 
20 valid positions, irrespective of whether they are chemically modified or cross-linked to 
other residues. Conversely, atoms or molecules not belonging to a protein or peptide, 
known in the art as ligands, are not considered as valid positions. In the case of patterns 
consisting of more than one residue, some further rules apply as follows. Any two 
residues of the same pattern p should not occupy the same position. Also, as will be 
25 appreciated by those in the art, it may be desirable to consider only patterns for which 
the residue positions are proximate in space since very distant residues are likely to be 
independent. However this is preferred only when (i) a large number of different 
patterns has to be examined and computational speed is a limiting factor, and (ii) the 
condition of independence and the related criterion of proximity have been statistically 
30 verified. In the preferred SPA mode of implementation of the invention, the residue 
positions of interest may be determined by the region in the protein reference structure 
which is of interest to the user. If the interest is e.g. in stabilizing a protein, only core 
residues may be taken into account and a set of patterns may be systematically defined 
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for all single core residues as well as all possible pair-wise combinations thereof. 

A further aspect of the present invention is the chemical nature of the one or more 
residues ofp which may be validly considered at the selected positions in the reference 
structure. According to the invention, all naturally-occurring and synthetic amino acid 
5 types may be included in the set of valid residue types. The choice of these residue 
types is basically unrelated to the amino acid sequence of the reference structure. If a 
residue type ofp differs from the residue type in the reference structure at the selected 
position, this actually corresponds to a mutation. In such case, the original residue type 
needs to be removed from the reference structure and replaced by the pattern residue 

10 type. Each replacement step is preferably done by respecting the rules of standard 
geometry (standard bond lengths and angles and default torsion angles). In the preferred 
SPA mode of implementation of the invention, the set of single residue patterns may 
consist of all valid residue types as defined above, placed at each residue position of the 
region of interest. The set of pair residue patterns may consist of all possible pair-wise 

15 combinations thereof. 

Fully consistent with the present invention is any method used in step B in order to 
further restrict the number of valid residue types at each position in the region of 
interest. Such an embodiment may be used (1) when the available computational 
resources necessitate a restriction, or (2) in cases which are interrelated with the 
20 application of ECO's. Ignoring the first case (thus assuming infinite computational 
resources), it will be clear to those skilled in the art that some residue types may be 
discarded from consideration depending on the position in the reference structure. 
Examples of such restriction methods include, without limitation: 

- the exclusion of synthetic residue types, 

25 - the exclusion of proline (actually an imino acid) residues depending on the local 
main-chain conformation, 

- the restriction of core residues to non-polar types, 

- the exclusion of residue types with low helical propensity in alpha-helices, 

- the preservation of the chemical nature (e.g. aromatic) of residue types in the 
30 reference structure, and so on. 

So far, a pattern/? is defined as one or more residue types placed into the reference 
structure at well defined residue positions. In order to unambiguously define p, the 
conformation of each .^-residue must be further specified by the assignment of one 
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specific conformation to each residue of p, preferably by selecting a specific rotamer 
from a rotamer library as described hereinabove. Alternatively, in the non-rotameric 
embodiment of the present invention, the said conformational assignment may result 
from other sources including, but not limited to, experimentally determined protein 
5 structures, computer-generated conformations or, as described below, a grouping 
process applied to different ECO's. The present invention does not impose restrictions 
— to the nature and origin of the conformation of the p-residues. In the preferred SPA 
mode of implementation of the present invention, all (combinations of) rotamers are 
applied to the ^-residues. For patterns consisting of residue pairs, this leads to a six- 
10 dimensional space. In order to keep the "volume" of this hyper-space within reasonable 
bounds, a variety of simple and more complicated criteria may be established in order to 
reduce the number of patterns for residue pairs. However, such criteria are ignored in 
the present invention since the results of any reduction method necessarily leads to a set 
of patterns of interest, which is the subject of the present invention. 
15 The most important property of an ECO is its energy value which is defined as the 

difference between the global energy of the reference structure and the global energy of 
the same reference structure after its modification by the steps of (1) introducing the 
ECO pattern into the reference structure and (2) computing the globally optimal 
conformational rearrangement of the reference structure as illustrated in Fig. 2. Steps (1) 
20 and (2) will be referred to as the pattern introduction (PI) step and the structural 
adaptation (SA) step respectively. Computing the ECO energy value is therefore 
preferably completed by a third step called the energy calculation (EC) step wherein the 
global energy of the substituted and adapted structure is calculated and further wherein 
the global energy of the reference structure is subtracted from this value. 
25 The PI step involves the necessary substitutions and the generation of the correct 

conformation of the p-residues, as dictated by the precise definition of the pattern p. 
This is preferably accomplished by a conventional protein modeling package such as 
disclosed by Delhaise et al. (cited supra) comprising practical tools in order to perform 
amino acid substitutions and conformational manipulations. Once the pattern p is 
30 introduced, the amino acid type(s) and conformation(s) of the substituted residues are 
frozen throughout the process leading to the ECO energy value computation. The PI 
step can be seen as a specific perturbation of the reference structure, leaving this 
structure in a sub-optimal state. Indeed, the introduced residues in their fixed 
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conformation may very well be in steric conflict with surrounding amino acids. 

In the SA step, the latter are given a chance to rearrange and find a new relaxed 
conformation in which steric constraints are alleviated and intramolecular interactions 
are improved. According to the present invention, this rearrangement is energy-based, 
5 i.e. the cost function to be optimized during the SA step is the energy of the global 
structure as defined in the section ENERGY FUNCTION. In a preferred embodiment of 
the invention, especially the rotameric embodiment described in the section 
CONFORMATIONAL FREEDOM, all atoms of the pattern residues and the main- 
chain of the reference structure are held fixed throughout the SA step. In a less preferred 

10 embodiment, the pattern atoms remain fixed but the main-chain atoms may slightly (i.e. 
by about 1 to 2 A) change in position during the SA step. In another embodiment, and 
only for the non-rotameric approach, also the pattern atoms may change in position 
during the adaptation step; this embodiment is least preferred since the introduced 
pattern then looses its original structure and accounts, at least partially, for the structural 

1 5 relaxation effects. 

The SA step itself may be performed by any method suitable for protein structure 
prediction (see STRUCTURE OPTIMIZATION) on the basis of an objective energetic 
scoring function (see ENERGY FUNCTION). The goal of the SA step is to find the 
energetically best possible substituted and adapted structure. In this step, the substituted 

20 reference structure is submitted to a structural optimization process exploring the 
conformational space of the protein system, taking into account (i) the restraints 
imposed onto the backbone and pattern atoms (which are preferably fixed), (ii) the 
allowed conformational freedom (e.g. as determined by a rotamer library) and (iii) the 
applied energy function. The specific procedure used in order to perform the said 

25 structural optimization is not critical to the present invention. Methods which may be 
used for this purpose are described in the section STRUCTURE OPTIMIZATION. 

In the EC step, the ECO energy value is calculated as a differential value, i.e. the 
energy of the substituted and adapted structure obtained from the SA step, minus the 
energy of the reference structure. Both energies can be readily obtained by a standard 

30 energy calculation method using the relevant energy function in order to compute all 
relevant intra-molecular interactions and providing a quantitative estimate for the free or 
potential energy of a protein structure. Therefore the ECO energy value obtained by 
substracting both energies is a quantitative measure for the energetic cost to mutate the 
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pattern p into the reference structure. As far as the computed energies correspond 
with the true thermodynamic free energies of the original (WT) and mutated proteins, 
they will reflect the change in thermodynamic stability of the mutated protein. 
Consequently, the ECO energy value can also be considered as a fair estimate of the 
5 structural compatibility of the pattern p within the framework of the protein of interest. 
The ECO energy value can be either positive (in which case the /^-residues are expected 
to be less compatible with the protein than the WT residues) or negative (in which case 
the corresponding pattern is expected to stabilize the protein molecule). 

Apart from the ECO pattern and the associated energy value, an ECO may also 
10 include other relevant information, including data referring to the origin of the ECO 
such as (i) a reference (e.g. a PDB-code) to the (e.g. experimentally determined) 3D 
structure of the protein used in the ECO generation, (ii) a reference to the rotamer 
library used to model the reference structure and the adapted structure, (iii) a reference 
to the applied energy function, (iv) a reference to the method used in the adaptation 
15 step, (v) a description of the reference sequence and structure, and (vi) the 
transformations (as discussed below) which may be applied to the native ECO. The 
latter data are important for a correct interpretation of the ECO energy. Moreover, they 
provide a large amount of added value: 

- first, since an ECO is self-describing, it should be possible to verify or reconstruct 

20 its energy value. 

- secondly, the information related to energetic transformations provide the user with 
a full autonomy to restore or re-transform these data (e.g. to express the data in 
normalized units) if needed in a particular application. 

- thirdly, this information considerably facilitates all forms of data mining. For 
25 example, it makes it easy to compare sets of ECO's constructed with different 

rotamer libraries, homologous backbone structures, energy models, adaptation 
methods or reference sequences. 
In addition to the aforementioned data, other relevant though non-essential 
informations may be stored in the ECO object. Examples of such informations include, 
30 but are not limited to, (i) a description of the observed secondary structure for each 
pattern position, (ii) atomic -coordinates for some or all atoms of the pattern residues, 
e.g. atomic coordinates for the Ca atoms, which may be helpful in identifying 
neighboring/distant residues without loading a full 3D-structure, (iii) a description of the 
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solvent accessibility (e.g. the accessible surface area) of each p-residue. While not 
essential for the assessment of the ECO energy value, the storage of such additional data 
may be valuable when using ECO's for specialized applications like high-throughput 
fold recognition or protein design. A more extensive list of additional ECO elements is 
provided in the section FORMAL DESCRIPTION OF ECO'S. 

For the sake of compactness, data elements in an ECO object which refer to the 
protein structure, residue types, rotameric states, and so on are preferably encoded as 
indices pointing to data present in external lists, instead of explicitly including the raw 
data. Also, ECO's sharing common properties (e.g. ECO's generated for the same 
protein structure, using the same rotamer library and energy function) may be grouped 
into a data structure (e.g. a file, an array or. a database), the shared properties being 
assigned to the data structure rather than included in each ECO object. However, this 
kind of data storage and encoding is purely a mode of implementation, not an essential 
aspect of the present invention. 

Many further transformations of the native ECO energy values obtained from the 
EC step may be useful within the context of specific applications of the present 
invention. Examples of such transformations include, without limitation, (i) the 
expression of ECO energies relatively to the lowest -value of all ECO's known for 
different patterns located at the same residue positions, so that each transformed ECO 
energy becomes positive, (ii) transforming ECO energies by means of a linear, 
logarithmic or exponential function, (iii) truncating every transformed or non- 
transformed ECO energy above a given threshold value to that threshold value, (iv) the 
expression of ECO's in normalized units, and so on. 

Another type of transformation, i.e. the grouping of native ECO's into grouped 
ECO's (referred as GECO's), does form part of the present invention. This grouping is. 
applicable to the SPA mode wherein a multitude of ECO's are generated for identical 
positions in a protein. In a preferred embodiment of this invention, ECO's sharing 
identical positions in the protein of interest may be grouped into GECO's in order to (i) 
reduce the amount of information and (ii) extract the most significant information. The 
importance of ECO grouping as well as some preferred methods to generate GECO's is 
discussed in the section GROUPED ECO'S. 
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In order to formally describe ECO's and to use ECO energies 

for single and pairs of residues in computing the energy of multiply substituted proteins, 
the following notations are used (see also LIST OF SYMBOLIC NOTATIONS). 
. N: the set of residue positions of interest within the protein of interest. Example 1 : N 
5 = {all amino acid residues of the protein}. Example 2: N = {all fully buried amino 

acid residues of the protein} . 

• i: a specific residue position of interest f, where i e N. Example: i = 1 denotes the 

first residue position of N. 
. A(j): the set of residue types of interest at position i. In the SPA mode of the present 
10 invention, this set may contain multiple residue types. Example 1: A(l) = {all 

natural amino acid residue types}. Example 2: A(l) = {Gly, Ala, Ser}. 

• f : a specific residue type of interest a at position z, where f e A(/). Example: i = 1 
denotes the second residue type of A(l), at the first residue position. 

• R(0: the set of rotamers of interest for residue type a at position i. Rotamers of 
15 interest may be retrieved from either a backbone dependent or independent library. 

Example 1 : R(r°) = R(a) = the set of rotamers known for residue type a in the 
rotamer library described by De Maeyer et al. (cited supra). Example 2: R(0 = R(a) 
\ {rotamers which are backbone-constrained at position /}• Example 3: the set of 
rotamers R(i°) where a refers to a Gly residue type may contain either no element or 
20 one dummy element since Gly has no rotatable side chain. 

• £: a specific rotamer of interest r for residue type a at position i, where i? e R(i"). A 
specific instance of if forms a uniquely defined single residue pattern. Example: if = 
1 3 2 denotes the pattern formed by the third rotamer of R(l 2 ), for the second residue 
type of A(l), at the first residue position. 

25 • [i,J]- the pair of specific residue positions of interest i and./. 

• [i - ,/]: *e pair of residue positions i and /, where residue types a and b are placed at 
positions i and j, respectively (in an undefined conformation). 

• - *e pair of residue positions i andy, where residue types a and b are placed at 
positions i andy, respectively, and residue types a and b adopt rotameric states r and 

30 s, respectively. A specific instance of [i r °, jft forms a uniquely defined pair residue 

pattern. 

• [ij 9 K • . - ]• the multiplet of residue positions of interest ij 9 k, ... 
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• .-]: the multiplet of residue positions i,j 9 k, ... where residue types a, 
b 9 c 9 ... are placed at positions /, y, k, respectively (in an undefined 
conformation). 

• [ir 9 j S9 ft, . ..]: the multiplet of residue positions £, . . . where residue types a, 6, c, 
5 ... are placed at positions i 9 J 9 k y respectively, and residue types a, b 9 c, ... adopt 

rotameric states r, s, /,..., respectively. A specific instance of [i?J? 9 k c ty ...] forms a 
uniquely defined multi-residue pattern. 

• P: the complete set of patterns of interest when generating ECO's in the SPA mode. 

• P(/): the set of single residue patterns of interest at position i, where P(i) c P. By 
10 analogy, P([/,/|) and, in general, P([i,y, k y ...]) is defined for pair and multiple 

* 

residue patterns. 

• P(z*): the set of single residue patterns for residue type a at position /, where c 
P(i) c P. By analogy, and, in general, P(|T,/, Af, ...]) is defined for pair 
and multiple residue patterns having defined residue types. 

15 • ref: the reference structure of the protein of interest. Preferably the reference 
structure is the energetically best possible global structure for the protein of interest. 
In the preferred embodiment of the invention wherein rotamers are used to describe 
the allowed conformational states of residue types, the reference structure can be 
written as a special pattern [/ g r ,y g r , £ g r , ...]. In this pattern, all N residues k, ... of 

20 the protein (excluding those of the template) assume the reference amino acid type 

as defined in the reference sequence (referred to as the "reference type'* r), and a 
particular rotameric state as obtained from a structure prediction method such as a 
DEE method (referred to as the "global minimum energy rotamer" g). Note that r 
and g are not variables and are therefore not written in italic. Less preferably, the 

25 reference structure may be any energetically relaxed, e.g. energy minimized by a 

steepest descent or conjugated gradient method, structure. Least preferably, the 
reference structure may be any non-optimized experimental or theoretical structure 
for the protein of interest or, alternatively, the reference structure may be void and 
may be assigned a reference energy E ref = 0. 

30 • E ref : the global energy of the reference structure. In the preferred embodiment using 
rotameric descriptions, a fixed template and a pair-wise summable energy function, 
E ref can be conveniently written as 
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E ref = E temp , a , e + 2, EtOg-) +2,- Zy>, EpO^ygO (eq.4). 

Here, E lemp iate is the template self-energy, i.e. the sum of all atomic interactions 
within the template. The superscripts r refer to the reference amino acid type and the 
subscripts g denote the reference rotamer (the global minimum energy rotamer). 
E^ig 1 ) is the direct interaction energy of z' g r with the template, including its self- 
energy (the subscript t refers to the template). The self-energy may include a solvent 
correction term such as a TTS term as discussed in the section ENERGY 
FUNCTION. Ep(j g r ,y g r ) denotes the pair-wise residue/residue interaction energy, also 
called rotamer/rotamer pair-energy between z' g r and Likewise, pair-energies may 
include a solvent correction term as discussed in the section ENERGY FUNCTION. 
Optionally, the term E temp i ate can be omitted in equation (4) since (i) it is a constant 
and (ii) E ref is to be used in combination with E(j" ), discussed below, wherein the 
same term can be discarded (in such case, E ref is not a global but a partial energy). A 
great advantage of equation (4) is that the E, and E p terms can be pre-computed for 
all possible instances of single and pair residue patterns. These values can then 
repeatedly be used in the search for the optimal reference structure as well as in the 
computation of ECO energy values. In a less preferred embodiments wherein no 
rotamers are used and/or the template is not fixed and/or a non-pair-wise summable 
energy model is applied, the reference energy E ref must be calculated in accordance 
with the characteristics of the applied energy model, and typically as a sum over all 
possible inter-atomic interactions within the reference structure. In the least 
preferred embodiment, where no reference structure is considered, Eref may be set 
equal to 0 or any arbitrary value, with the consequence that ECO energies can no 
longer be interpreted as "substitution energies". However, such embodiments may" 
still be useful in specific applications wherein the absolute values of ECO energies 
are irrelevant, e.g. when different mutations are compared one with another. 
• E(/r ): the global energy for the reference structure which has been modified by (1) a 
PI step wherein the single residue pattern i? is mutated into the reference structure 
and (2) a SA step as described in the section DESCRIPTION OF THE ECO 
CONCEPT.. This is not a differential but an absolute energy for a complete 
molecule. In the preferred embodiment using rotamers, a fixed template and a pair- 
wise energy function, this energy can be calculated in accordance with following 
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equation (5), wherein the terms applicable to residue / of the single residue 
pattern /" are replaced by the appropriate energy contributions: 

Biff) - E temp ,ate + E t (/«) + Zj E p (/ r fl ,^.) + Z, -EtOi) +£; Z k>J E p (/^„ ; y,**/ ( eq .5). 

5 

In this equation, the residue positions may have quit their reference rotameric 
state r as a result of the SA step, but not their reference type g: Adaptation occurs by 
conformational rearrangements wherein type switches are not allowed. In the 
optional case that E lem piate has been omitted in the calculation of E ref , the same term 

10 should be omitted in equation (5) as well and E(/ r fl ) is not a global but a partial 

energy. In a less preferred embodiment wherein no rotamers are used and/or the 
template is not fixed and/or a non-pair-wise summable energy function is applied, 
the value for E(/ r fl ) must be calculated similarly to the reference energy E ref , typically 
as a sum over all possible inter-atomic interactions within the substituted and 

15 adapted structure. 

• E E coO'r): the ECO energy value for the single residue pattern As previously 
defined, it is the energy of the reference structure which has been substituted by and 
conformationally adapted to the pattern i? , E(/ r a ), minus the energy of the reference 
structure: 

20 EecoOV*) « E(*«) - E ref • ( eq.6). 

This definition is valid for all embodiments of the present invention. By analogy, it 
is also possible to define ECO energies for pair patterns. In the preferred 
embodiment using rotamers, a fixed template and a pair-wise energy function, the 
global energy of the reference structure substituted by and adapted to a pair residue 

25 pattern [i?J*] can be written as follows: 

mirjtl) = E tcmp , a te + E t (/ r fl ) + E t (/*) + E p (/ r fl ,/0 + S* E P 0V a , tf) + Z k EJj* 9 tf) 

+ I* E t (#) +X k Z l>k Ep(tf , /J) ; kMJ (eq.7) 

The ECO energy for a pair residue pattern is then 

E E C0([lVV,*]) = m?Js]) - E re f (cq.8) 

30 In the less preferred embodiments wherein no rotamers are used and/or the template 
is not fixed and/or a non-pair-wise summable energy function is applied, the value for 
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E([i'r ,js]) must be calculated in accordance with the characteristics of the 
applied energy model, and typically as a sum over all possible interatomic interactions 
within the substituted and adapted structure. In the general case of multi-residue 
patterns, the calculation of ECO energies occurs as for single and pair residue patterns. 
Here, only the equation for the ECO energy of a multi-residue pattern [ifjs, k c t , ...] is 



given: 



(eq.9) 



In order to complete the above formal description of ECO's, a more concrete 
description of the term is now provided. An energetic compatibility object is a 

10 structured data set comprising elements or pieces of information. Here, a distinction 
may be made between (i) basic elements and (ii) optional elements, depending on 
whether they are essential for a correct execution of an ECO-based application or 
whether they may simply be helpful in such execution. Clearly, this distinction depends 
on the specific application of interest to a user. 

15 The basic elements of an ECO are those which together form a full description of 

the origin and characteristics of a substituted and adapted protein structure, i.e. in 
particular: 

I. a structural description of the protein of interest or its PDB-code. 

II. a description of the applied method for structural pre-refinement of I, if any. 

20 III. a description of the applied rotamer library, if any. In the preferred embodiment of 

the present invention, this corresponds to the assignment of a set of rotamers R(i°) 
to each position i defined in (V). For positions i being pattern residues, a refers to 
a pattern residue type, whereas for non-pattern positions of interest, a refers to the 
residue type as defined in (VI). 

25 IV. a description of the applied energy function. 

V. a description of the set N, i.e. the set of residue positions i of interest. This set 
consists of positions in the protein at which a pattern p is placed (pattern 
positions), as well as positions which are considered in the SA step (non-pattern 
positions). The complement of N is defined as the template which, in the preferred 

30 embodiment of this invention, is fixed. 

VI. a description of the amino acid reference sequence, i.e. the assignment of one 
specific "reference amino acid type" r to each position i defined in (V). 
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VII. the method applied to compute the reference structure for (VI), starting from 
(I) or (II), and using (III) and (IV) (e.g. a DEE method). 

VIII. a description of the reference structure for the residue positions and types defined 
in (V) and (VI) respectively and obtained by the method defined in (VII), e.g. the 

5 assignment of a "global optimum" rotamer g to each position defined in (V). 

IX. the pattern residue position(s), i.e. [/,/, k, ...] 

X. the pattern residue type(s), i.e. If, ...] 

XI. the pattern residue conformations, i.e. [i? 9 jf, tf, ...] (this may be optional for 
grouped ECO's or in the embodiments not using rotamers). 

1 0 XII. the method applied to perform the SA step for the reference structure defined in 
(VIII), wherein the pattern defined in (IX), (X) and (XI) is introduced, preferably 
being the same method as used in (VII). 

XIII. the ECO energy value E E co([/°, jL tf, ...]) as defined by equation (9) or, 
alternatively, equations (6) and (8) for single and pair residue patterns, 

1 5 respectively. 

Optional elements of an ECO are in particular: 

XIV. a specification of the residues which have changed their conformation (e.g. by 
adopting a rotameric state different from g) during the SA step, as well as a 
description of the actual changes such as, preferably, the new rotamers. 

XV for grouped ECO's, a reference to the grouping method, as well as a 
comprehensive description of the results of the grouping process. 

XVI. one or more functional properties of the protein of interest. 

XVII. a local secondary structure description of the pattern residue positions [i 9 j, £,...] 
within the protein structure as in (I). 

25 XVIII. a description of the solvent accessibility of the pattern residues within the 

reference structure (i.e. [ig,y g r , £ g r , ...]) and/or within the substituted and adapted 

structure (i.e. [i?Js 9 kf 9 ...]). 
XIX. cartesian coordinates of one or more atoms of the pattern residues within the 

reference structure and/or within the substituted and adapted structure. 
30 XX. crystallographic temperature factors associated with the (atoms of) the pattern 

residues within the protein structure from which the reference structure has been 

deduced. 

XXI. known chemical modifications associated with the pattern residues within the 
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protein structure from which the reference structure has been deduced. 
XXILphysico-chemical properties associated with the pattern residue types. 
Clearly, this list can be enlarged with any additional property of interest to the user. 

5 SYSTEMATICAL PATTERN ANALYSIS (SPA) MODE 

ECO's may be generated by the present method at the time they are needed, for 
example, to search locations in a protein where tryptophan substitutions are allowed. 
Dozens of questions of this type may be answered by one or more real-time, case- 
specific pattern analysis experiments. However, some structure-related questions can 
10 only be resolved by the present method if vast amounts of ECO data are available. 
Moreover, the same ECO information may be used recurrently within the context of 
different protein design experiments. Also, specific design and recognition tasks may be 
performed in the absence of any explicit 3D-representation of a protein. Finally, pre- 
computed sets (or databases) of ECO's may be data mined to gain structure- and 
15 sequence-related insights, again creating added value. Therefore, there is a need to 
generate ECO's in a systematic way, called the "systematical pattern analysis" (SPA) 
mode of operation. 

The notations, conventions and abbreviations used below conform to those listed in 
the section FORMAL DESCRIPTION OF ECO'S. 

20 In the SPA mode of the present invention, wherein only the rotameric embodiment 

is relevant, ECO's are systematically generated for a pre-defined set P of patterns being 
substituted into the reference structure of a protein of interest. Also, a distinction needs 
to be made between single and double residue ECO's. Multi-residue ECO's are not 
considered in the SPA mode only because their systematical computation may be too 

25 . slow. 

For single ECO's, the set P is constructed as follows. First, a set I of residue 
positions t is defined, where I is a subset of N (the positions of interest, actually all 
modeled residues). At each position i e I, a set A(0 of residue types of interest is 
defined. A residue type a placed at position i is noted as f, where f e A(i) and where f 
30 can be interpreted as a single residue pattern in an as yet undefined conformation. It is 
noteworthy that at different positions i, different sets of types A(i) may be considered. 
Next, for each defined position/type combination i°, a set R(/ a ) comprising rotamers of 
interest for type a at position i is established, preferably retrieved from a rotamer 
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library. A rotamer r, generated for type a, placed at position i is noted as i? 9 where i? 
e R(/*) and where /* can be interpreted as an unambiguous single residue pattern peP. 

For double ECO's, basically the same approach is followed, but all combinations of 
types and rotamers need to be considered for all possible pairs of positions. More 
5 specifically, define by ij e I a pair of residue positions, by f e A(/) and/ e A(j) a 
pair of residue types at positions i and j 9 respectively, and by if e R(i°) and/? e a 
pair of rotamers for types a and Z> at positions i and j\ respectively, then the set P 
consists of all possible unique and unambiguous pair-residue patterns [if 9 jf\ 9 where i < 

10 In the SPA mode, the reference structure for the protein of interest needs to be 

generated only once (see REFERENCE STRUCTURE) and the associated reference 
energy E re f can be used in the calculation of the ECO energy value for all patterns of the 
set P. Then all patterns of P are substituted into the reference structure, the structure is 
adapted to the pattern and the total energy is calculated using equation (5) or (7) for 

15 single or double-residue patterns, respectively. This yields necessary and sufficient data 
to calculate the ECO energy associated with each single and/or double residue pattern, 
using eqs. (6) and/or (8), respectively. 

Finally, the generated data are stored on a suitable storage device (preferably into 
an array, a file or a database) where typically one object per pattern is created; each 

20 object contains (some of) the data and references described in the section FORMAL 
DESCRIPTION OF ECO'S. Post-analysis operations including transformations (see 
DESCRIPTION OF THE ECO CONCEPT) or grouping operations (see GROUPED 
ECO'S) may then be applied. 

25 GROUPED ECO'S (GECO'S) 

Besides energetic transformations (see DESCRIPTION OF THE ECO 
CONCEPT), the most important post-analysis method, applicable to ECO's generated 
in the SPA-mode of the present invention, is the grouping of ECO's into GECO's. 
While there is a nearly infinite variety of possible grouping operations, we shall 
30 specifically address GECO's which are preferred in view of their practical usefulness 
and ease of interpretation i.e. (i) single GECO's, (ii) double GECO's and (iii) GECO's 
resulting from a grouping operation performed on the basis of residue type properties. 

A single GECO, denoted G(/*), is a GECO derived from an ensemble of single 
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residue ECO's, generated in SPA-mode, for a specific amino acid residue type a 
placed at position i, and a set of rotamers R(r°), i.e. the patterns of this ensemble only 
differ in rotameric state. Each ECO of this ensemble can be seen as the result of a trial 
experiment wherein the compatibility of its associated pattern, i.e. a particular residue 

5 type a at position i in rotameric state r, is investigated within the context of an adaptive 
reference structure. A clear measure for the said compatibility is the energy of each 
ECO of this ensemble, E EC o('r°)- In a preferred embodiment, the set of E EC o(*r) values 
for a given residue position and type f and all rotamers r e R(?) can simply be grouped 
by searching over all rotameric states for the minimum value for E E coO'r) according to 

10 equation (10): 

E G eco(0 = min r EecoOV") (eq - 1 0) 

wherein Egeco(0 is the energy value resulting from the minimum searching operation 
on all EEcoO'r) values. If desired, the min-rotamer r min can replace the pattern rotamer 
element in a GECO object (element (XI) in the FORMAL DESCRIPTION OF ECO'S) 

15 but this is only optional since a GECO is primarily intended to assess the compatibility 
of an amino acid type at a specific position, rather than that of a conformation. A 
numerical illustration of such grouping operation is shown in the following table 2. 

Another preferred method to group ECO's is by taking the average over a set of 
low-energy values, wherein the latter set includes all values for E EC oOV°) below a 

20 suitably chosen threshold value T, according to equation (11). 

EgecoO") = aver r (E E coOV a ) I E EC o(»r ) ^ T) 1 D 

wherein aver r is the averaging operator working on E E coOV°) values below T for all 
rotamers r at f. When applying equation (11), some residue types a may not be 
compatible with position i, i.e. if no EecoOV") value exists below T. When applying this 
25 grouping method, it is impossible to assign a specific rotamer r to the resulting GECO, 
but this is irrelevant if the grouping process is only for the purpose of assessing the 

compatibility of a amino acid types. 

Another preferred grouping method is based on using a probabilistic averaging 

function such as shown in equation (12): 
30 EgecoO-) = Zr ErcoOr) p(iV°) (eql2) 



wherein 
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p(/V a ) = e X p(-pE EC o(/V a ))/Z(/ r a ) (eq> j 3) 

and wherein p is a suitable constant (e.g. p=l/(k B T) wherein k B is Boltzmann's constant 
and T is the temperature in degrees Kelvin) and Z(/ r a ) is the partition function defined by 
equation (14) as: 

Z(/ r ") = 2, exp(.pE EC o(iV a )) (eq . 14) 

This grouping method is highly preferred since it is statistically supported and 
requires no absolute energetic threshold value. 

A double GECO, denoted G([/", /]), is a grouped ECO derived from an 
ensemble of double residue ECO's, generated in SPA-mode, for pair residue patterns 
implying [f,j% GflV 2 , /]) which provides information regarding the compatibility of 
the pair of residue types a and b, at positions i and j, respectively, without necessarily 
specifying their rotameric states. The possibilities to group double ECO's for pair 
residue patterns [i?,jt\ are analogous to those for grouping single ECO's. In a first 
embodiment, the minimum energy value for all pair-wise combinations of rotamers r 
and s can be searched according to equation (15): 

EoECoflVr = min r , 5 E EC o(['r ,jf}) (eq. 1 5) 

wherein min rj is the minimum searching operator. 

A second embodiment involves averaging combinations having a pair energy below a 
given threshold, following equation (16): 

E GE co([/r jtD = aver,., (E ECO ([i?,jfD I E ECO ([>r < T) (eq. 1 6) 

wherein aver r is the averaging operator. 

A third embodiment is a statistical averaging according to equation (17): 

E G ECo(V?jft) = 2^ EEcod/^y^) pitfj?]) (eq. 1 7) 

wherein 

P(['*r = exp(^E EC o([/r jft)yZ([i?J s b ]) - (eq. 1 8) 

and the partition function Z([i?,J* ]) is 

Z([iV fl J?}) = X r .s exp(.pE E co([/r jfjf) (eq. 19) 
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Another type of grouping is at the level of amino acid features or 

properties rather than rotamers, which is useful in assessing the compatibility of a 
certain type of amino acid, e.g. an aromatic amino acid, at residue position i. Indeed, 
GECO's can be constructed by grouping amino acids sharing some feature or property 
5 of interest. For example, G(i Po1 ) where Pol = {S,T,N,D,E,Q,RK} denotes a GECO for a 
polar amino acid at position i. The energy value associated with such a GECO can be 
suitably defined by equation (20) as: 

E 0EC o(« Kind ) = aver aeK ind EgecoO") < et l- 20 ) 
Such a grouping is preferably performed on GECO's already grouped by type (over 
10 rotameric states). Other potentially useful grouping operations include, but are not 
limited to, sets of: 

- aromatic amino acid types where Kind = {H,F,Y,W} , 

- small residue types where Kind = {G,A} or {G,A,S}, 

- P-branched types where Kind = {I,V,T} , 
15 and the like. 

Generally speaking, this latter type of grouping is intended for and suitable to 
condense primary information into structured information of a higher level which may 
be useful, for example, in fold recognition by fast screening of amino acid sequences 
against structure-based profiles. 

20 

ASSESSING GLOBAL ENERGIES FROM SINGLE AND DOUBLE (G)ECO'S 

The present invention also relates to a method to assess the global fitness of a 
protein structure in which an arbitrary number of substitutions have taken place 
compared to a reference amino acid sequence. More specifically, it is possible to assess 

25 the global energy, and thereby the fitness, of a multiply substituted protein structure by 
making use of single and preferably also double ECO or GECO energy values. The 
computation of this global energy may occur by simply summing energetic terms for 
single residues and pairs of residues as demonstrated hereinafter. As a consequence, the 
compatibility of a given amino acid sequence with a given protein 3D structure, after 

30 proper alignment, can be assessed fastly, compared to current methods using 
conventional modelling techniques. 

First, the assessment method of the invention establishes a mathematical 
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relationship between single and double ECO energies. Then this relationship is 
further analyzed, thereby demonstrating how double ECO energies can be derived from 
single ECO energies. Next, this method is extended so as to compute the energy of 
multi-residue patterns from single ECO energies. Finally, it is discussed how single and 
double GECO's can be used to assess the global energy of a multiply substituted protein 
structure. 

In the preferred embodiment of the present invention, i.e. when using rotamers, 
a fixed template and a pair-wise energy function, the global energy of a reference amino 
acid sequence in its reference structure is given by equation (4). When isolating, in this 
equation, the energetic contributions for two residue positions of interest i and j 9 we 
obtain equation (21): 

E re f=E, empla ,e + 2* E,^ +2* Zi> k E p (* g r , /g 1 ) 

+ EtOg 1 ) + EtC/g") + EpC/gJg-) 

+ 2* Ep(/ g r , k£ + 2* E p (/g r , -fcg 1 ) ; i*J; k,Mj ( eq .2 1 ) 

15 wherein all terms and symbols have the same meaning as in equation (4). 

A similar equation for E(z?) can be derived from equation (5), wherein the 
contributions for the residue positions / and j are isolated, recalling that E($ is the 
energy for the reference structure in which the reference rotamer ig has been replaced by 
the pattern rotamer f r and where the other residues have had the opportunity to assume a 

20 new lowest energy rotamer g' which may or may not be different from the reference 
rotamer g. Thus, we can write E(z£) as 

E(# = E, em p late + 2* E,(fc g .) +2* 2/>* E p ()tg., / g .) 

+ E t (0 + E,(/g.) + Eptfjg.) 

+ 2* E p (& * g .) + 2* EpOg., A g .) ;i*J;k,Mj (eq.22) 

25 Likewise, we obtain for E(/t), i.e. the global energy of the reference structure 

substituted by and adapted to the pattern j*, wherein possible rotameric adaptations are 
denoted by the subscript g": 

E(/t) = E tem p,a, e + 2* E t (Ar g .) +2* 2/>a Ep(*g., 

+ E t (/g.) + E,0t) + Ep(/ g .,y*) 
30 ~ + *k E p (/^, k r g .) + 2* EpOt k g .) ; k,MJ (eq.23) 
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The energy of the pair pattern tf, j\ causing rotameric adaptations 

denoted by the subscript g'"is given by equation (24): 

E([&ytl) = Eten.p.a.e + U E t (*g-) +£* Z/>* Ep(*,-, /g-) 

+ E,(# + E,(/1) + EpO^yt) 
5 +2* Ep(i% *g-) + 2* E p (/£ *g-) ; k,MJ (eq-24) 

If the expression (21) for E ref is subtracted from equations (22) to (24), we obtain 
the equations for the single ECO energies EecoO?) and E EC o(/t) and for the double ECO 
energy E EC o([&./t)) as follows: 

Eeco(# = 2* (E^O-E.^*)) +2* Z,>* (E p (/: g ., /g.)-Ep(* g , / g )) 
1 0 + (Etft-^Q) + (E,(/'g-)-E t 0 r g)) + (Ep(£7 B 0-Ep(ig,/g)) 

+ 2* (E p (& ifc g .)-Ep( I -g, Ar r g )) + 2* (Ep(/ r g., *g')-E p (4 k$) ; fe$ *^j(eq.25) 

E EC o0t) = 2* (E t (Ar r g .VE t (^g)) +2* 2/>* (E p (* g ., / g .)-Ep(fc r g , / r g )) 

+ (E,(/ r g .)-E,(/ g )) + (EtOt^EtO'g)) + (Ep(ig-,yt)-Ep(i g ,/g)) 

+ 2* (Ep(i r g ., it r g.>-E p (ig, /c g )) + 2* (E p (/t A: r g .)-E p (/ r g, * r g)) ; fc# k,MJ(eq.26) 

1 5 E E co([ir,7tl) = 2* 0W*t)-E<*D) +S * S '>* CEpC^g- /g-H^g, '£)) 

+ (E,(z?)-Et(i r g)) + (E,(/f)-E,(/ r g)) + (Ep(^7t>-Ep(/ r g,7g)) 

+ £* (Ep(ft ^g-y-EpO^g, * r g)) + 2* (EpOt fc r g-)-Ep(/ r g, ; M^V(eq.27) 

Each of equations (25) to (27) consists of a large number of terms which have been 
ordered specifically to make them more interpretable, as follows: the first line in each 

20 equation contains terms related to non-pattern residue positions kMJ only; the second 
line, in contrast, contains only terms related to pattern residues ij; and finally, the third 
line contains interactions between pattern positions ij and non-pattem positions kMJ. 
Moreover, all terms are grouped in pairs wherein the reference contributions "g" are 
subtracted from the pattern related terms "g», g» g'"", i.e. each pair of terms therefore 

25 reflects the difference brought about by a pattern substitution. Bearing this in mind, we 
can substitute 



WO 01/37147 



PCT/EPOO/10923 



51 



AE^, f (p) 



'2 i (E,(^.)-E,(^))+Z*Z /> *(E p (4,/ g r .)-E p (A: g r ,/ g r )) ; p=i? 
S*(E,(^..)-E t (^))+S,Z /> *(E p (^,/|.)-E p (A: g r J / g r )) ; 
Z*(E,(^ g -)-E t (A: g r )) + i:,S />A (E p (^,/^)-E p (A: g r ,/ g r )) ; /,=[»;,/*] 



(eq.28) 



where AE£ e , f 0?)is preferably interpreted as the difference in the global internal (or 
"self) energy of the non-pattern residues kj*ij (compared to the reference structure) 
due to conformational rearrangements induced by pattern p, where p is either ft, j* or [£ 
ji). Similarly, we can substitute 




•a 



\P = [irJs] 



J 5 



(eq-29) 



- - * » 

where AE| J m (p)is preferably interpreted as the difference in total interaction energy 
between the pattern residues ij and the non-pattern residues ktij (compared to the 
reference structure) due to conformational rearrangements induced by pattern /?, where p 
is either £ j* or [F r >J% When entering the former two quantities into equations (25) to 
(27), we obtain 

E E co(#= AE£ !f (tf) 



EecoOt) ~ 



+ mtHX®) + (E,( /g .)-E t 0 g )) + (Epd- y^Epd^) 
AE-f(y*) 

+ (E,(/ g .)-E t ( / - g )) + (E t (/t)-E,(/D) + (Ep(/s-,7t)-Ep(/ 8 ,^)) 

E E co([&y?]) = AE^, f ([i r a ,y*]) 

+ (E^E^) + (E,0l>-E t (/ r g)) + (EpCi?, A-EpCi^yg)) 



(eq.30) 



(eq.31) 



0137147A2J_> 



WO 01/37147 



PCT/EP00/10923 



52 

+ ABg t ([/;./fD (eq.32) 



This set of equations now allows to write a double ECO energy as a function of 
single ECO energies. For this purpose, equations (30) and (31) are subtracted from 
equation (32). 

E EC o([»r,yt]) - EecoO?) - EecoO^) - 

AE s k elf ([/? ,/* ]) - ( AE s k elf iff ) + AEU 0* ) ) 

+ AEj*([i?J?D- (AEtf (i r -)+AEt*0?)) 
- (Et(/g.>-Et(/g)) - (E^^.)-E t (i r g )) 

+ (E p (^,7t)-Ep(4/g)) " (E P 0r,yg.)-Ep0^/ g )) - (E p (i r B ., 7 thEp(/-g,/g)) (eq.33) 

The terms appearing in the second and third line of equation (33) may be 
analysed as follows. First, they measure the energetic consequences for the non-pattern 
residues btij due to the introduction of the patterns & fs and [f r ,j% Secondly, it can be 
safely assumed that introducing any single or pair residue pattern is unlikely to change 
the conformation of an entire protein structure, whereas the introduction of e.g. a single 
residue pattern f r is likely to cause adaptation events for only a limited number of 
residues k in its immediate environment. In other words, it is expected that a protein 
structure shows some "plasticity" behaviour with respect to perturbations of its structure 
(i.e. the reference structure), i.e. small patterns are likely to induce small, local changes. 
Taking this expectation for true, it can also be expected that two different single residue 
patterns f r and f s may have non-overlapping associated sets of adapted residues. 
Formally defining by K(p) the set of non-pattern residues t&ij which adopt a different 
rotameric state (compared to the reference structure) after the introduction of, and 
adaptation to, the pattern p, for the three possible sets of patterns defined on i and/or j 
we have: 

K(# = {k | *g. * k\) (eq.34) 
K0t>M*l*£-**y (eq.35) 
K([ft,/J5)a{*|*i.**;} ( ec l- 36 ) 

The condition for non-overlapping sets of adapted residues associated with 
patterns f r andyt is then represented by equation (37): 

K(# rs K(/t) - 0 (cq.37) 
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A question addressed by the method of the present invention is 

then to define the relationship between the sets K(i% K(/t) and K([? r ,j§) and, more 
specifically, the conformation of the elements in each set. Ignoring all other theoretical 
possibilities, we introduce herein an hypothesis called "additivity of adaptation" (AA) 
which is illustrated in figure 3 and assumes that conformational adaptation effects 
caused by a pair pattern [& j%] are identical to the sum of all adaptation effects caused 
by each of its constituting single residue patterns ? r and j*. A first criterion for this 
hypothesis to be fulfilled is that the introduction of a pair pattern [& j%] affects exactly 
the same residue positions ktij as being affected by either the single residue pattern f r 
or J} , which may be formally represented by equation (38): 

K(# u K0$ = K([£y£]) ( e q .38) 
A further requirement for AA is that all rotamers k\.*k\ induced by the pattern 
are found among the rotamers induced by either the single residue pattern f r or 
This criterion can be expressed formally by the following equation: 



;Vk keK(js) 



(eq-39) 



This criterion is unambiguous if the conditions sef forth in equations (37) and 
(38) are fulfilled. Then, under the AA hypothesis, the equations (30) to (32) can be 
drastically simplified, e.g. by combining equations (28) and (39): 

AE s k eIf ([irJs])~ AE s k elf (/ r a )-AE s k eIf 0 (eq.40) 
and, similarly, by combining equations (29) and (39): 

AES*([i r -,/*D- AEj*(/ r -)- AEjfa*) = 0 (eq.41) 
so that equation (33) simplifies to: 

EECo([*r,yf]) = E EC o(it) + E EC o(/t) 

- (e.oih^)) - ms-huti) 

+ (E p (/%yt)-Ep(4/g)) - (E P (£^.)-E p (/ g ,y*)) - (E p (i r g .,j$-E p (f e J e )) (eq.42) 

When the AA criteria are met, equation (42) provides a means to compute the 
energy of a double ECO from the energies of single ECO's. This equation only requires 
terms directly involving pattern positions i and /, not terms related to non-pattern 
positions. The said terms involving pattern positions / and j can be readily computed as 
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soon as the rotamers/ g . (induced by Q and i\. (induced hyft as well as the associated 
template (E t ) and pair (E p ) energies are known and can be considered as corrections 
needed to derive double ECO energies from single ECO's. These terms can be grouped 
into a new function EtU&A), comprising the said energetic corrections needed to fuse 
5 two single ECO energies into one double ECO energy, defined in equation (43): 

£**(&;$ = -E^E^^ < e * 43 > 
Then, still under the AA hypothesis, equation (42) becomes: 

E E C0([«r,yt]) = EECOOD + E E C0(/t) + Efuse(&;f) 

When the AA criteria are not met as illustrated in Fig. 3b, errors in equation (44) 
1 0 due to ignoring non-additivity effects can be included in a non-additivity term E„o„-aa01 
leading to a universally applicable expression: 

E ECO ([«r,yt]) = E E C0(# + EecoO?) + EfUseOr^t) + E non .AA(& $ ( ef l- 45 ) 

The first three terms at the right hand side of equation (45), as well as the double 
ECO energy, can be exactly calculated. Therefore, equation (45) provides a practical 
1 5 means to statistically analyse possible errors due to neglecting non-additivity effects and 
to correlate these errors with properties related to single residue patterns. 

The present invention is also able to investigate in detail the relationship 
between single and double GECO's. This is of special importance because GECO's, 
being grouped over the rotameric dimension (r), are dependent only of a position (i) and 
20 type (a) and, consequently, allow for a huge reduction of data while keeping only the 
most relevant information. Since GECO's reflect how well residue types (not 
necessarily rotamers) can fit at specific positions in a protein structure, analysis of the 
relationship between single and double GECO's would possibly make it possible to 
answer questions like: 

25 - If a double substitution is energetically compatible, is each single substitution 

[?] and [/*] energetically compatible as well, and vice-versa?, and 
- can double GECO's be easily derived from single GECO's? 

The ultimate aim of such analysis is to predict the energetic compatibility of multiple 
substitutions from information related to single and double GECO's, thereby facilitating 
30 protein design and fold recognition applications by allowing detailed, structure-based 
predictions without actually considering any 3D structure. 
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By analogy to equation (45), the following relationship between 

single and double GECO energies is first assumed: 

E GE co([iV]) = Egeco(0 + EgecoC/*) + Ea^O*,/) + E^^f,/) (eq 4 , 

wherein E GE co([^,/]) is the energy of a GECO of a special pattern p defined on residue 
positions / and/or j carrying types a and/or b, respectively, but where the rotameric 
states is are not necessarily defined. The non-additivity term E^^f,/) has the same 
meaning as previously, i.e. it reflects the energetic consequences of a possible 
inconsistence between adaptations caused by patterns f and / versus Although 
some ambiguity may lie in this term since the rotameric state of each of these patterns 
may be undefined, except for GECO's derived by equations (10) and (15), however this 
term can simply be ignored since the method of the present invention is preferably used 
under the AA hypothesis. In equation (46), E^i",/) may be considered as a correction 
term needed to take into account that f has been considered in the reference structure 
where j had type r (and mutatis mutandis for/). Since both types a and b are likely to 
be different from the corresponding types in the reference structure, this fuse-term can 
adopt a wide variety of values, both negative and positive. In contrast to the foregoing 
description, it would be in conflict with the basic GECO concept to attempt to calculate 
EfuseO » J ) by similarity with equation (43) for the following reason. If the rotameric 
states of the involved residues are unknown or irrelevant, then it becomes impossible or 
irrelevant to consider the associated rotamer/template and rotamer/rotamer interaction 
energies as in equation (43). A much more practical way to assign E filse (i a ,y*) values is 
to derive them directly from the single and double GECO energies according to 
equation (47): 

E^if,/) = EoEcotfi",/]) - (EgecoO") + EqecoO*)) (eq.4 

which is true when there is additivity in the adaptation effects caused by the pair of 
single GECO's and the double GECO. However, when the AA hypothesis is not 
fulfilled, the energetic consequences (actually, E non -AA(i a » y*)) will be incorporated into 
the fuse term EfaeO*/). Anyhow, equation (47) can be used to compute double GECO 
energies from a database containing single GECO energies and pair-wise fuse terms 
without effectively generating the double GECO. 

Another aspect of the present invention is a method to compute the energy of 
multiply substituted proteins from single and double GECO energies. Considering that: 
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- equation (45) means that the substitution energy of a multiplet partem 
(such as a pair) can be derived from the substitution energy of basic patterns (i.e. 
single residues), 

- the present invention relates to energy functions which are essentially pair-wise 
5 additive in the sense that the global energy of a protein structure can be written in a 

form equivalent to equation (4), 

- the fuse term in equation (45) mainly contains energetic contributions which are 
directly and exclusively related to pattern residue positions, and 

the non-additivity effects are expected to be small in size compared to the fuse term in 
10 eq (45). 

Taking into account the previous elements and remarks, we postulate that the total 
substitution energy of a protein structure comprising a multiplet pattern of the type [f, 
/, k°, . . .] can be approximated by the equation 

EcEcott^/, k c , - ..])« E G Eco(i°) + E G eco(/*) + Eqeco^ + - 

1 5 + Efa^i",/) + EteeO*. k c )+... 

+ Efu Se (A k°)+ ... 

+ (eq- 48) 

> » » • 

If the pattern residue positions are not explicitly given a specific name, then 
equation (48) can be written in a simpler form: 

20 EgecoCp) * 2, EgecoO^ 0 ) + 2, S,>, IWi"V»> (eq - 49) 

where p is a multiplet pattern defined on any number of residue positions of interest i 
and a residue type of interest a(i) is placed at position i in an undefined rotameric state, 
and where EgecoO"* 0 ) is the single residue GECO energy for i* (0 , and where E^O* ' , 
/W) is the fuse term for two single residue patterns i" (0 and/ 01 , preferably calculated. 
25 from the corresponding double GECO energy and both single GECO energies using 
equation (47). 

Clearly, equation (49) forms an essential aspect of the present invention since it 
indicates that the energetic compatibility of any amino acid sequence of interest within 
the context of an adaptive reference structure can be assessed from only single and pair- 
30 wise energy terms (which are preferably stored in a database). 

Potential errors in the calculation of Egeco(p) by means of equation (49) compared 



WO 01/37147 PCT/EP00/10923 

57 

to when p would be effectively substituted and modeled into the reference structure, 
may result from different sources. However, the quantitative global error in equation 
(49) can be included in an extra error term for the multiplet pattern p, E^Ap), so that 
we obtain 



E GE co(p) - Z, E G eco(^°) + 2, £,>,• Efi.seO^,/^ + E em>r (p) 



(eq. 5( 

The latter equation provides a practical means to quantitatively estimate the total 
error for a variety of different patterns p. Indeed, if E GE co(p) computed with equation 
(49) is renamed as E* p f ™ (p) and E CEC o(p) resulting from the true modeling ofp in the 
reference structure is referred to as EoicoOO , then 

E e tTor(p)= E^I^q(p) — Eqeco(p) (eq. 51 

In the section FOLD RECOGNITION APPLICATION, a correlation is made 
between the approximate and modeled GECO energies for a selected protein structure 
and a wide variety of different patterns substituted into the topologically most difficult 
part of the protein, i.e. the core. 

PROOF OF PRINCIPLE 

Hereinafter, the practical usefulness of the ECO concept with respect to two 
important fields of application is illustrated. In a first experiment, a fold recognition 
simulation is performed by searching a near-optimal correlation between a target amino 
acid sequence of interest and a transformed-GECO profile derived for a protein 
structure which is homologous to the target sequence. In a second experiment, a protein 
design experiment is carried out on three selected positions in the B 1 domain of protein 
G (PDB code 1PGA). More specifically, at three selected residue positions 100 
randomly chosen triple mutations were modeled and the global energy of each modeled 
structure was compared to the energy calculated on the basis of single and double 
GECO energies, in accordance with equation (49). From a correlation plot between the 
two sets of energy data, it follows that the global energy of a multiply substituted 
protein structure may be estimated from only single and double GECO energies with 
sufficient accuracy. Since the latter computations occur extremely fast compared to the 
effective modeling of multiple substitutions, the method based on single and double 
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GECO's may be useful to rapidly generate a set of amino acid sequences which are 
likely to be compatible with a protein structure. 

FOLD RECOGNITION APPLICATION 

In order to exemplify the potential of the ECO concept with respect to fold 
recognition applications, single residue ECO's were generated for hen lysozyme (PDB 
code 3LZT) by applying the SPA embodiment of the present invention (see TABLE 1 
for experimental settings and data). Next, the single residue ECO's were grouped into 
single residue GECO's for each residue position/type combination by searching the 
minimal ECO energy value over all rotamers, in accordance with equation (10) (see 
Table 2). The resulting set of GECO's is hereinafter referred to as the initial GECO 
profile for the protein studied. The initial profile for 3LZT was stored to disk so that 
different versions of a recognition method (described below) could be tested using the 
same initial profile. 

It has been observed that five different consecutive transformations of GECO 
energies were helpful to obtain a near-optimal alignment between a target amino acid 
sequence and a GECO profile. A first transformation, performed on the initial GECO 
profile for 3LZT, involved the translation of the distribution of GECO energies for a 
given residue position by the lowest value observed for that position over all residue 
types. The transformation equation used was 

Egeco.10") = EcecoO") - min* EgecoO") 52) 
where Egeco.iO"") is the transformed GECO energy for residue type a at position i, 
EgecoO") is the initial GECO energy for f and min fl is the min-operator to search for the 
lowest possible E GE co(0 value at position L As a result of this operation, all 
transformed GECO energies became equal to or greater than zero. A second 
transformation included the truncation of all E G eco,i(0 values to a maximal value of 
100, as formally expressed by equation (53): 

Egeco^O 0 ) = min (E GEC o.i(O,100) ( ec l- 53 > 

where min is the min-operator acting on its two arguments, i.e. Egeco.iO"") and a value 
of 100. As a result, strongly restrained substitutions (see Table 2) received a default 
value of 100. A third transformation was a logarithmic rescaling of E GE co,2(0 values by 
applying the following equation: 
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Egeco^O") = log (1 + Egeco^O"*) x 0.99) 

whereby a E G eco,2(0 interval [0,100] was mapped onto a EgeccuO") interval [0,2] by a 
decimal logarithm function. Next, these values were converted to normalized units as 
follows: first, a fourth transformation was performed on the distributions of EgecoX*") 
5 values for each position i. These distributions were assumed to be Gaussian (which is, 
admittedly, an approximation) and the average value, aver fl (E G Eco,3(0), and standard 
deviation, std^EcEco^CO), w ere calculated for each position i over all residue types a. 
Then, equation (55) 

Egeco,4(i°) = (E G eco.3(0 - aver fl (EGEco,3(i*))) / std^EcEco^O) (eq. 55) 

• ♦ 

10 was performed on all E G eco,3(0 values, which essentially rescales the transformed 
GECO energies to a number of standard deviations above or below the average value 
observed for a position i. Negative E G eco,4(0 values may be associated with 
"favorable" or "better than average" substitutions whereas positive values may be seen 
as "unfavorable" substitutions. Next, a fifth transformation was performed on the 

1 5 E G eco,4(0 values for the 20 residue type distributions by the equation 

E G eco,s(^) = (E GE co,4(0 - aver,(E G ECO,4(/*))) / std^EcECO^O) (eq. 56) 

which rescales the Egeco,4(0 values to a number of standard deviations above or below 
the average value observed for a residue type a. The reason for the fourth 
transformation was to include a correction for the intrinsic level of difficulty to place 

20 different residue types at a given position, while the fifth transformation intended to 
correct for the intrinsic level of complexity to place a given residue type at different 
positions. A set of Egeco^O values for all residue positions i of a given protein 
structure and the 20 naturally occurring residue types a at each position is hereinafter 
called a transformed GECO profile for the protein involved. 

25 Target amino acid sequences were compared with transformed GECO profiles 

using a modified Smith- Waterman (SW) alignment procedure in accordance with R. 
Durbin et al. in Biological Sequence Analysis. Probabilistic models of proteins and 
nucleic acids (1998) at Cambridge University Press. The modification consisted in that 
no gap formation was allowed, which is equivalent to assigning an infinite gap penalty 

30 value. Since the SW approach requires that (i) a favorable alignment of a residue against 
a given position in a profile receives a positive score value (whereas favorable 
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EGEcasO"") values are negative) and (ii) the global expected value of score 
contributions should be negative (whereas the global expected value for Egeco.sO*) 
values is zero as a result of the normalization operations), two additional 
transformations of the Egeco.sO"') values were needed. First, the sign of the EgeccsO -0 ) 
5 values was inverted to meet the first requirement: 

Egeco,6(0 = -Egeco.sO") <«l- 57 > 

In addition, during construction of a SW alignment matrix, a value of 0.5 was 
subtracted from the E G eco,6(0 values to meet the second requirement. This means that a 
residue only contributes in a favorable way to an alignment if its EGECo,6(i°) value is 
10 greater than 0.5 (i.e. if the log-transformed Egeco,3(«°) value is at least 0.5 standard 
deviations below the average value at position i). However, since this transformation 
was only intended to identify high-scoring sequence fragments on the basis of maxima 
in the alignment matrix, in accordance with the SW method, the non-transformed 
Egeco,6(0 values were used in all further calculations. From the SW alignment matrix, 
15 the 50 highest-scoring fragments were selected, being characterized by (i) an offset in 
the target sequence, (ii) an offset in the transformed GECO profile, (iii) a length / and 
(iv) a cumulative alignment score z being the sum of E G eco,6(0 contributions for 
positions i" in the profile and residue types a observed in the target sequence. 

The obtained fragments were further statistically analyzed by calculating the 
20 probability that they were found by pure chance. The probability P(/,z) that a randomly 
aligned fragment of a given length / and consisting of uncorrected, randomly selected 
residue types has an alignment score equal to or better than a given score z can be 
derived as follows. If two variables x\ and x 2 are independent normal variables with an 
average value of 0 and a standard deviation of 1, here written as N(0,1), then their sum 
25 is distributed as N(0,V2), i.e. a distribution centered around 0, with a standard 
deviation equal to -Jl , according to Neuts in Probability (1973) at Allyn and Bacon, 
Inc. (Boston). By extension, the sum of n such variables is distributed as N(0,V«) . 
Then, the probability P(/i,z) that the sum of n such- variables is equal to or greater than a 
value z, is given by the integration of the probability density function N(0,Vn) : 

30 P(/v)= r-jL^e^dx («l- 58 > 
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When applied to aligned fragments of length / having a cumulative score z and 
for which the constituent residues have a normal distribution N(0,1), the probability that 
the said value z or any greater value is obtained by pure chance is given by equation 
(59): 

rl -' 2 
-F=e « dx , 

The latter equation allowed to assign a P-value to all high-scoring fragments, which 
P-value corresponds to the probability that one random positioning of a fragment of 
length / results in a score of at least z, which is different from the probability to obtain at 
least one fragment of length / with a score of at least z (in the latter case, the size of the 
target sequence and the profile need to be considered). The assignment of a P-value to 
each of the 50 highest-scoring fragments allowed to rank all fragments according to 
their significance. Finally, all fragments having a P-value below 10" 4 were retained as 
potentially significant. 

The final global alignment was obtained by first clearing, in the aforementioned 
SW alignment matrix, all values for matrix elements not forming part of any retained 
fragment and, secondly, performing a Needleman-Wunsch (NW) protocol to find the 
best possible concatenation of the retained fragments (R. Durbin et aL, cited supra). The 
NW protocol was performed using a gap opening penalty of 2.0 and a gap extension 
penalty of 0.2, resulting in a unique global alignment. The assessment of the 
significance of a global alignment was performed by summing the cumulative score for 
all fragments forming part of the global alignment, diminished by 2 x (the number of 
fragments - 1), further diminished by 0.2 x the total number of non-matched residue 
positions between matched fragments, inputting this value into equation (59) in order to 
obtain a global P-value and qualifying the result as a "positively recognized fold" if the 
global P-value is below 10* 15 . Alternatively, a fold was also considered as "positively 
recognized" if the best fragment had a P-value below 10~ 10 . This alignment procedure 
may not be optimal but that it forms a feasible and practically useful method to illustrate 
the potential of the ECO concept with respect to fold recognition. 

The aforementioned procedure for ECO-based analysis of the structural 
compatibility of a target amino acid sequence with a protein 3D structure was applied to 
a homologous pair of proteins from the C-type lysozyme family. Concretely, the 
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sequence of human alpha-lactalbumin was aligned with the transformed GECO 
profile of hen lysozyme (PDB code 3LZT). The sequence identity between both 
proteins is 40% for the best matching fragment of 113 residues (out of 123 for 
lactalbumin and 129 for lysozyme), according to a BLAST alignment performed at the 
5 internet address http://www.ncbi.nlm.nih.gov/. As a result of the alignment procedure, 
10 aligned fragments were retained having a P-value below 10" 4 (FIG. 4). The best 
fragment had a P-value of 8.9 x 10" 15 , which already qualified the lactalbumin target 
sequence as being structurally compatible with the 3LZT fold. Three other fragments 
(P-values in ascending order were 1.5 x 10'", 6.3 x 10" 7 and 1.9 x 10' 5 ) also formed part 
10 of the final global alignment. Since the 3D-structure of human alpha-lactalbumin is 
known (PDB code 1B90), a structural comparison with 3LZT was possible. It was 
found that the NW concatenation procedure succeeded in selecting only those fragments 
which are indeed similar in structure. Non-matched regions (between the fragments) 
were observed typically in loop regions where one or more deletions occurred with 
15 respect to lysozyme. On the basis of this comparison, the alignment can be clearly 
qualified as truly positive. The six fragments not forming part of the final alignment 
were dissimilar in structure and were correctly excluded by the NW concatenation. 

In order to investigate the discriminative power of the method with respect to non- 
homologous sequences, the lactalbumin sequence has been randomly permuted 1000 
20 times and each sequence was aligned against the 3LZT profile. From each alignment, 
the best scoring fragment with the lowest P-value was retained. Out of these 1000 
experiments, the best random fragment had a P-value of 1.4 x 10' 6 which agrees well 
with the false positive fragments generally appearing in each alignment experiment 
(FIG. 4). It also shows that the false positive fragment in the lactalbumin/lysozyme 
25 alignment having an exceptional P-value of 7.3 x 10" 9 must due to intra-fragment 
correlation effects, i.e. the high-scoring lactalbumin fragment cannot be equivalent to a 
random fragment. Since no structural similarity between the target and profile fragments 
could be observed, the nature of the presumed intra-fragment correlation is as yet not 
understood. Moreover, portions of this exceptional fragment also appeared in three 
30 other false positive fragments (i.e. fragments 5, 6 and 9 in FIG. 4), although their P- 
value was typical for random fragments, i.e. around 1 0 . 

In conclusion, these experiments show that the ECO concept is useful to identify a 
structural relationship between two protein molecules by aligning a target amino acid 
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sequence against a GECO profile. Moreover, a global structure-based 
alignment may be obtained. These results are obtained by only using single ECO's, 
more specifically the substitution energies of single residues in the context of a 
reference structure which may be up to 60% different in amino acid sequence compared 
to a target sequence. When suitably including information derived from double ECO's, 
the quality of the alignments should expectedly increase, since double ECO's contain 
information related to pair-wise residue/residue interactions. 



PROTEIN DESIGN APPLICATION 

10 The Bl domain of protein G (PDB code 1PGA) was chosen to simulate 100 

different randomly selected triple-residue substitutions and to compare the global 
energy obtained by the effective modeling of each triple-substitution with the global 
energy estimated on the basis of single and double GECO energies. This was not 
intended to predict the most stable variant but rather to show that protein design 

15 applications based on single and double (G)ECO's form a valuable alternative 
compared to previously known methods directly operating on a 3D structure, thereby 
being confronted with the combinatorial substitution problem. 

First the GMEC of the 1PGA structure was computed using the FASTER method 
(such as previously described) and the same rotamer library and energy function as in 

20 all other calculations (see Table 1). All residues having a rotatable side chain were 

included in the GMEC computation. The global energy of this structure, which was 

taken as the reference structure in all experiments related to 1PGA, was -149.3 kcal moT 
i 

■ 

Three proximate residue positions being located in the core of 1PGA were selected 
25 to perform further design experiments. More specifically, residues 5 (LEU), 30 (PHE) 
and 52 (PHE), were chosen to be mutated into 100 different random combinations of 
residue types. Each combination is hereinafter called a mutated sequence and 
represented as {X,Y,Z} where X, Y and Z refer to the residue types placed at positions 
5, 30 and 52, respectively. The WT "mutation" {L£U, PHE, PHE} was added to this set 
30 as an extra test. Each mutated sequence was modeled into 1PGA exactly in the same 
way as the WT sequence, after having performed the necessary side-chain replacements. 
These modeling experiments were executed by a pure structure prediction method (i.e. 
the FASTER method) and not by an embodiment of the present invention: the 
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conformation of the mutated residues was not kept fixed (in contrast to patterns) but 
was co-modeled with that of all other rotatable side chains. Analysis of the results 
showed that the global energies of the mutated sequences ranged from -149.3 to 22274 
kcal mol* 1 . However, all positive values were associated with sequences containing at 

5 least one PRO residue which is prohibited in a p-sheet (positions 5 and 52) and in an a- 
helix (position 30). The highest energy for sequences not containing PRO was -85.5 
kcal mof 1 and was associated with the sequence {TRP,GLN,VAL} . The sequences with 
a negative energy clustered around -121.6 ± 9.7 kcal mol" 1 (standard deviation). No 
sequence had an energy better than the WT sequence, not surprisingly since only 100 

10 out of the 20 3 possible sequences were examined and since the WT structure is well- 
packed. 

In order to exemplify the use of the ECO concept with respect to protein design, it 
was investigated how the energies from the modeled sequences could be approximated 
by using only single and double GECO energies. For this purpose, all possible single 

15 and double ECO's for the three selected positions were generated in agreement with the 
rotameric SPA embodiment of the present invention. The same experimental settings 
were applied as in the previous fold recognition experiments . ECO's were grouped into 
GECO's in accordance with equations (5) and (10) but no transformation was 
performed on the resulting GECO energies. 

20 From the sets of calculated single residue GECO energies, E G eco0"")> and double 

residue GECO energies, E G eco([^»/]), an equivalent set of fuse terms, E^seO^/), was 
deduced by applying equation (47). The values for all fuse terms related to positions 5 
and 30 are listed in Table 4. Most of the fuse terms are negative, which means that 
effectively modeled double mutations are generally better than the mere sum of each 

25 single substitution energy. Roughly speaking, negative values are expected to arise fromr 
better interactions in a double substitution compared to each single substitution being 
modeled into the reference structure, although a double substitution may also be 
energetically more constrained than each of the single mutations involved. In addition, 
some effects of non-additivity of adaptation may play a role as well. Yet, the fuse terms 

30 automatically correct for all these effects so that, together with the single GECO 
energies, they can be used "backwards" to reconstruct a double mutation without 
actually needing to model it. 

The question of how equation (49) performs in the computation of the global 
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energy of multiply substituted proteins was then investigated as follows. Each of 
the 101 triplet mutations (100 random sequences plus the WT sequence) was calculated 
using equation (49) and these energies were compared to the energies resulting from the 
effective modeling in a correlation diagram (FIG. 5). It was observed that both data sets 
correlate surprisingly well, given the numerous possible sources of approximation. Out 
of the 101 sequences, 55 had a difference between modeled and calculated energy less 
than 1 kcal mol" 1 in absolute value and 25 more had a difference less than 5 kcal mol" 1 . 
The present invention includes difference of less than 10 kcal mof 1 . Moreover, in the 
most interesting range, i.e. for low-energy sequences, the correlation was even better. In 
the range up to 20 kcal mof 1 above the energy of the WT sequence all 15 sequences 
differed in energy by less than 1 kcal mol" 1 in absolute value. Some larger differences 
were observed as well, but they were found exclusively in regions of higher energy. 
Even for the most constrained sequences containing at least one PRO ( 1 5 sequences; 
results not shown in FIG. 5), 14 had a difference in energy below 10 kcal mol 1 , while 
15 the energy of these modeled sequences was always higher than 400 kcal mol' 1 . While 
most energy values calculated from single GECO energies and fuse terms were roughly 
evenly distributed above and below the values from the modeled sequences, the largest 
differences were always positive, which may be due to non-additivity in the adaptations 
upon calculation of the double residue ECO's (leading to an overestimation of the fuse 
20 terms). For comparison, the energy of each sequence calculated as the sum of each 
involved single GECO energy (ignoring the fuse terms) was plotted in FIG. 5 as well. 
Even here some correlation with the effectively modeled energies can be observed but 
the percentual difference is significantly higher, which illustrates that the fuse terms 
contribute to the global energy in a most beneficial way. 
25 In conclusion, the estimation of the global energy of multiply substituted proteins 

may be performed to sufficient accuracy by using information derived from only single 
and double residue ECO's at least in the case studied, i.e. the core of the Bl domain of 
protein G. It is well-known in the art that core substitutions are among the most difficult 
ones since delicate packing effects are involved. Therefore, the present invention allows 
30 the design of new sequences, irrespective of the region where mutations are proposed. 
In addition, it is worth noting that the present embodiment to derive global energies 
from single and double residue GECO's is extremely fast when the latter values have 
been pre-calculated and can be retrieved from a file or database. Therefore, the present 
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method may be useful to rapidly search a reduced set of structure-compatible 
sequences out of a combinatorial number of possibilities. Sequences obtained this way 
may then be submitted to more detailed analysis. 

The present invention may be implemented on a computing device e.g. a 
5 personal computer or a work station which has an input device for loading the details of 
the reference structure and the reference amino acid sequence defined in step A of 
claim 1, and also any further amino acid sequences. The computing device is adapted to 
run software which carries out any of the methods in accordance with the present 
invention. The computer may be a server which is connected to a data communications 
10 transmission means such as the Internet. A script file including the details of the 
reference structure and the amino acid sequence may be sent from one near location, 
e.g. terminal, to a remote, i.e. second location, at which the server resides. The server 
receives this data and outputs back along the communications line useful data to the 
near terminal, e.g. the alignment of a sequence relative to the reference structure, or a 
15 set of favourable amino acid sequences for one or more residues in the reference 
structure, a protein where its sequence or a part thereof is determined by one or more of 
the claimed methods, a database or part of a database including sets of ECO's for one or 
more proteins; an ECO in the form of data structure. 

Fig. 6 is a schematic representation of a computing system which can be utilized 
20 in accordance with the method and system of the present invention. The system and 
method provided by the present invention can be implemented with the computing 
system depicted in Fig. 6. A computer 10 is depicted which may include a video display 
terminal 14, a data input means such as a keyboard 16, and a graphic user interface 
indicating means such as a mouse 18. Computer 10 may be implemented as a general 

25 purpose computer. 

Computer 10 includes a Central Processing Unit ("CPU") 15, such as a 
conventional microprocessor of which a Pentium III processor supplied by Intel Corp. 
USA is only an example, and a number of other units interconnected via system bus 22. 
The computer 10 includes at least one memory. Memory may include any of a variety of 

30 data storage devices known to the skilled person such as random-access memory 
("RAM"), read-only memory ("ROM"), non-volatile read/write memory such as a hard 
disc as known to the skilled person. For example, computer 10 may further include 
random-access memory ("RAM") 24, read-only memory ("ROM") 26, as well as an 
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optional display adapter 27 for connecting system bus 22 to an optional video display 
terminal 14, and an optional input/output (I/O) adapter 29 for connecting peripheral 
devices (e.g., disk and tape drives 23) to system bus 22. Video display terminal 14 can 
be the visual output of computer 10, which can be any suitable display device such as a 
CRT-based video display well-known in the art of computer hardware. However, with a 
portable or notebook-based computer, video display terminal 14 can be replaced with a 
LCD-based or a gas plasma-based flat-panel display. Computer 10 further includes user 
interface adapter 30 for connecting a keyboard 16, mouse 18, optional speaker 36, as 
well as allowing optional physical value inputs from physical value capture devices 40 
of an external system 20. The devices 40 may be any suitable equipment for capturing 
physical parameters required in the execution of the present invention. Additional or 
alternative devices 41 for capturing physical parameters of an additional or alternative 
external system 21 may also connected to bus 22 via a communication adapter 39 
connecting computer 10 to a data network such as the Internet, an Intranet a Local or 
15 Wide Area network (LAN or WAN) or a CAN. The term "physical value capture 
device" includes devices which provide values of parameters of a system, e.g. they may 
be an information network or databases such as a rotamer library. 

Computer 10 also includes a graphical user interface that resides within a 
machine-readable media to direct the operation of computer 10. Any suitable machine- 
20 readable media may retain the graphical user interface, such as a random access 
memory (RAM) 24, a read-only memory (ROM) 26, a magnetic diskette, magnetic tape, 
or optical disk (the last three being located in disk and tape drives 23). Any suitable 
operating system and associated graphical user interface (e.g., Microsoft Windows) may 
direct CPU 15. In addition, computer 10 includes a control program 51 which resides 
25 within computer memory storage 52. Control program 51 contains instructions that 
when executed on CPU 15 carries out the operations described with respect to the 
methods of the present invention as described above. 

Those skilled in the art will appreciate that the hardware represented in FIG. 6 
may vary for specific applications. For example, other peripheral devices such as optical 
30 disk media, audio adapters, or chip programming devices, such as PAL or EPROM 
programming devices well-known in the art of computer hardware, and the like may be 
utilized in addition to or in place of the hardware already described. 

In the example depicted in Fig. 6, the computer program product (i.e. control 
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program 51 for executing methods in accordance with the present invention 
comprising instruction means in accordance with the present invention) can reside in 
computer storage 52. However, it is important that while the present invention has been, 
and will continue to be, that those skilled in the art will appreciate that the methods of 

5 the present invention are capable of being distributed as a program product in a variety 
of forms, and that the present invention applies equally regardless of the particular type 
of signal bearing media used to actually carry out the distribution. Examples of 
computer readable signal bearing media include: recordable type media such as floppy 
disks and CD ROMs and transmission type media such as digital and analogue 

1 0 communication links . 
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TABLE 1 shows a compilation of single residue ECO's generated by the present 
method. Each line represents an ECO. The columns are labeled according to the list 
described in the section FORMAL DESCRIPTION OF ECO'S. Only basic ECO 
properties are shown. 3LZT is the PDB code for hen lysozyme; SD200: 200 steps 
steepest descent energy minimization; all H: CHARMM force field of Brooks et al. 
(cited supra), including explicit hydrogen atoms and ITS contributions as listed in 
Table 3; Flex : all "flexible" residues having a rotatable side chain; WTrres: WT residue 
"res" as observed in 3LZT; 3LZT_G: global minimum energy conformation for 3LZT. 
The value "1" in column III refers to the rotamer library of De Maeyer et al. (cited 
supra). 
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CLAIMS 

1. A method for analyzing a protein structure, the method being executable in a 
computer under the control of a program stored in the computer, and comprising the 
following steps: 

5 A. receiving a reference structure for a protein, whereby 

- the said reference structure forms a representation of a 3D structure of the said 

protein, and 

- the protein consists of a plurality of residue positions, each carrying a particular 
reference amino acid type in a specific reference conformation, and 

10 - the protein residues are classified into a set of modeled residue positions and a set 

of fixed residues, the latter being included into a fixed template; 

B. substituting into the reference structure of step (A) a pattern, whereby 

- the said pattern consists of one or more of the modeled residue positions defined 
in step (A), each carrying a particular amino acid residue type in a fixed 

15 conformation, and 

- the one or more amino acid residue types of the said pattern are replacing the 
corresponding amino acid residue types present in the reference structure; 

C. optimizing the global conformation of the reference structure of step (A) being 
substituted by the pattern of step (B), whereby 

20 - a suitable protein structure optimization method based on a function allowing to 

assess the quality of a global protein structure, or any part thereof, is used in 
combination with a suitable conformational search method, and 

- the said structure optimization method is applied to all modeled residue positions 
defined in step (A) not being located at any of the pattern residue positions 

25 defined in step (B), and 

- the pattern and template residues are kept fixed; 

D. assessing the energetic compatibility of the pattern defined in step (B) within the 
context of the reference structure defined in step (A) being structurally optimized 
in step (C) with respect to the said pattern, by way of comparing the global 
30 energy of the substituted and optimized protein structure with the global energy 

of the non-substituted reference structure; and 
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E. storing a value reflecting the said energetic compatibility of the pattern 
together with information related to the structure of the pattern in the form of an 
energetic compatibility object. 

5 2. A method according to claim 1, wherein the reference structure of step (A) further 
represents the 3D structure of ligands including co-factors, ions or water molecules. 

3. A method according to claim 1 or claim 2, wherein the reference structure of step (A) 
is received from a database of protein structures. 

10 

4. A method according to any of claims 1 to 3, wherein the reference structure of step 
(A) has undergone a further energy optimization step by means of a molecular 
mechanics gradient method. 
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15 5. A method according to any of claims 1 to 4, wherein the reference structure of step 
(A) assumes a reference amino acid sequence being the WT sequence of the said 
protein. 

6. A method according to any of claims 1 to 4, wherein the reference structure of step 
20 (A) assumes a reference amino acid sequence being the sequence of lowest energy for 

the said protein. 

7. A method according to any of claims 1 to 4, wherein the reference structure of step 
(A) assumes a reference amino acid sequence being a polyglycine or a polyalanine 

25 sequence. 

8. A method according to any of claims 1 to 4, wherein the reference structure of step 
(A) assumes any possible reference amino acid sequence including a random or a 
dummy sequence. 

30 

9. A method according to any of claims 1 to 8, wherein the reference structure of step 
(A) has been modeled by the structure optimization method used in step (C). 
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10. A method according to any of claims 1 to 9, wherein the set of modeled residue 
positions comprises at least one position. 

1 1 . A method according to any of claims 1 to 9, wherein the set of modeled residue 
5 positions comprises at least three positions. 

12. A method according to any of claims 1 to 9, wherein the set of modeled residue 
positions comprises at least five positions. 

10 13. A method according to any of claims 1 to 12, wherein the pattern of step (B) 
consists of a single residue position, the said pattern being referred to as a single residue 
pattern. 

14. A method according to any of claims 1 to 12, wherein the pattern of step (B) 
1 5 consists of a pair of residue positions, the said pattern being referred to as a single 

residue pattern. 

15. A method according to any of claims 1 to 12, wherein the pattern of step (B) 
consists of a series of consecutive, interconnected residue positions, the said pattern 

20 being referred to as a loop pattern. 

16. A method according to any of claims 1 to 15, wherein each of the pattern residue 
positions defined in step (B) are assigned a naturally occurring amino acid residue type. 

25 17. A method according to any of claims 1 to 15, wherein each of the pattern residue 
positions defined in step (B) are assigned a naturally or a non-naturally occurring amino 
acid residue type. 

18. A method according to any of claims 1 to 17, wherein each of the pattern residue 
30 positions carrying a particular amino acid residue type as defined in step (B) are 
assigned a fixed conformation retrieved from a protein structure database. 
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19. A method according to any of claims 1 to 17, wherein each of the pattern residue 
positions carrying a particular amino acid residue type as defined in step (B) are 
assigned a fixed conformation which is computer-generated. 

5 20. A method according to any of claims 1 to 17, wherein each of the pattern residue 
positions carrying a particular amino acid residue type as defined in step (B) are 
assigned a fixed conformation which is received from a rotamer library, the said pattern 
being referred to as a rotameric pattern. 



10 



21. A method according to claim 20, wherein the said rotamer library is backbone- 
independent, the said library being referred to as a side-chain rotamer library. 



22. A method according to claim 20, wherein the said rotamer library is backbone- 
dependent, the said library being referred to as a combined main-chain/side-chain 

15 rotamer library, but wherein the main-chain information in the library is only used in 
order to assign a side-chain conformation to each of the pattern residue positions 
carrying a particular amino acid residue type as defined in step (B) which is structurally 
compatible with the local main-chain conformation of each corresponding residue 
position in the reference structure of step (A). 

20 

23. A method according to claim 20, wherein the said rotamer library is backbone- 
dependent, the said library being referred to as a combined main-chain/side-chain 
rotamer library, and wherein each of the pattern residue positions carrying a particular 
amino acid residue type as defined in step (B) are assigned a fixed conformation in 

25 accordance with a full main-chain/side-chain rotamer known in the said rotamer library 
for the corresponding amino acid residue type located at each pattern residue position. 

24. A method according to any of claims 1 to 23, wherein the said cost function of step 
(C) is an energy function allowing to assess the potential or free energy of a global 

30 protein structure or any part thereof. 

25. A method according to any of claims 1 to 24, wherein the said conformational 
search method of step (C) includes a DEE method. 
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26. A method according to any of claims 1 to 24, wherein the said conformational 
search method of step (C) includes a method comprising the steps of: 
(b) receiving a 3D representation of the molecular structure of a protein, the said 
5 representation comprising a first set of residue portions and a template; 

(b) modifying the representation of step (a) by at least one optimization cycle; 
characterized in that each optimization cycle comprises the steps of: 

(bl) perturbing a first representation of the molecular structure by modifying 
the structure of one or more of the first set of residue portions; 
10 (b2) relaxing the perturbed representation by optimizing the structure of one 

or more of the non-perturbed residue portions of the first set with respect 
to the one or more perturbed residue portions; 
(b3) evaluating the perturbed and relaxed representation of the 
molecular structure by using an energetic cost function and 
1 5 replacing the first representation by the perturbed and relaxed 

representation if the latter's global energy is more optimal than that 
of the first representation; and 

(c) terminating the optimization process according to step (b) when a 
predetermined termination criterion is reached; and 

20 (d) outputting to a storage medium or to a consecutive method a data structure 

comprising information extracted from step (b). 

27. A method according to any of claims 1 to 24, wherein the said conformational 
search method of step (C) includes a Monte Carlo simulation method. 

25 

28. A method according to any of claims 1 to 24, wherein the said conformational 
search method of step (C) includes a genetic algorithm method. 

29. A method according to any of claims 1 to 24, wherein the said conformational 
30 search method of step (C) includes a mean-field method. 
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30. A method according to any of claims 1 to 24, wherein the said conformational 
search method of step (C) includes a molecular mechanics gradient method such as a 
steepest descent- or a conjugated gradient energy minimization method. 

5 31. A method according to claim 30, wherein the said template residues are not kept 
fixed but are allowed to vary by at most 2 angstroms in root-mean-square deviation 
compared to the reference structure of step (A). 

32. A method according to claim 30, wherein neither the said pattern residues nor the 
10 said template residues are kept fixed but wherein they are allowed to vary by at most 2 

angstroms in root-mean-square deviation compared to the substituted reference structure 
of step (B). 

33. A method according to any of claims 1 to 29, wherein the residue positions to which 
15 the structure optimization is applied as defined in step (C) are rotameric, in that they are 

allowed to assume conformations received from a rotamer library in the same way as 
occurs for rotameric pattern residues. 



34. A method according to any of claims 1 to 33, wherein the said energetic 

20 compatibility of the pattern is calculated as the difference in global energy between the 

substituted and optimized protein structure of step (C) and the reference structure of 
step (A). 

35. A method according to any of claims 1 to 33, wherein the said energetic 
25 compatibility of the pattern is calculated as the absolute global energy of the substituted 

and optimized protein structure of step (C), this being equivalent to selecting in step (A) 
a reference amino acid sequence such as a polyglycine sequence or a dummy sequence. 

36. A method according to any of claims 1 to 35, wherein the self-energy of the 
30 template is ignored. 

37. A method according to any of claims 1 to 36, wherein the self-energy of the protein 
backbone is ignored. 
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38. A method according to any of claims 1 to 37, wherein the ECO formed in step (E) 
further includes an unambiguous description of the pattern in terms of the number of 
pattern residue positions, the location of each pattern residue position within the protein 
5 structure, the selection of an amino acid residue type at each position and the 
conformation of each pattern residue. 

39. A method according to any of claims 1 to 38, wherein the ECO further includes a 
description of or a reference to the protein structure received in step (A). 



10 



40. A method according to any of claims 1 to 39, wherein the ECO further includes a 
description of or a reference to the modeled residue positions defined in step (A). 

41. A method according to any of claims 1 to 40, wherein the ECO further includes a 
15 description of or a reference to the amino acid reference sequence defined in step (A). 

42. A method according to any of claims 1 to 41, wherein the ECO further includes a 
description of or a reference to the structure optimization method used in step (C). 

20 43. A method according to any of claims 1 to 42, wherein the ECO further includes a 
description of or a reference to the rotamer library used to assign the conformation of 
. the pattern residues in accordance with step (B) and/or to define the allowed 
conformational states of the modeled, non-pattern residues in accordance with step (C). 

25 44. A method according to any of claims 1 to 43, wherein the ECO further includes a 
description of or a reference to the cost function used in step (C). 

45. A method according to any of claims 1 to 44, wherein the ECO further includes a 
description of or a reference to the substituted and optimized protein structure obtained 

30 in step (C). 

46. A method according to any of claims 1 to 45, wherein the ECO further includes 
additional information derived from either the reference structure of step (A) or the 
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substituted and optimized structure of step (C), such as the local secondary structure 
of each modeled residue and/or its accessible surface area. 

47. A method according to any of claims 1 to 46, wherein steps (B) to (D) are repeated 
5 for different patterns defined in step (B). 



48. A method according to claim 47, wherein the patterns are single residue patterns 
located at one single residue position selected from the set of modeled residue positions 
of step (A) and are formed by systematically considering all possible combinations of 
1 0 amino acid residue types in different conformations, and wherein the amino acid residue 
types are selected from a pre-defined list of residue types and the said conformations are 
selected from a pre-defined list of conformations. 



49. A method according to claim 47, wherein the patterns are double residue patterns 
1 5 located at two different single residue position selected from the set of modeled residue 

positions of step (A) and are formed by systematically considering, for the pair of 
selected residue positions, all possible combinations of amino acid residue types in 
different conformations and wherein the amino acid residue types are selected from a 
pre-defined list of residue types and the said conformations are selected from a pre- 
20 defined list of conformations. 

50. A method according to claim 48, being systematically repeated until each single 
residue position of the set of modeled residue positions of step (A) has been selected as 
a pattern residue position. 

25 

5 1 . A method according to claim 49, being systematically repeated until each possible 
pair of residue position of the set of modeled residue positions of step (A) has been 
selected as a pair of pattern residue positions. 

30 52. A method according to claim 51, wherein the number of considered double residue 
patterns is restricted in accordance with a method taking into account a measure of the 
distance between two residue positions in the reference structure of step (A). 
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53. A method according to claim 52, wherein the number of considered double 
residue patterns is restricted in accordance with a method further taking into account a 
measure of the size of the amino acid residue types placed at each of both pattern 
residue positions. 

5 

54. A method according to any of claims 1 to 53, wherein the quantitative measure 
representing the energetic compatibility of a pattern within the context of a substituted 
and adapted reference structure is further transformed. 

10 55. A method according to claim 54, wherein the said transformation is effected by 
means of a linear, logarithmic or exponential function. 

56. A method according to any of claims 47 to 55, wherein the ECO's are further 
grouped into GECO's and wherein this grouping operation occurs on the basis of the 

15 quantitative measure representing the energetic compatibility of different patterns being 
located at the same residue position(s) and assuming the same residue type(s). 

57. A fold recognition method to identify a potential structural relationship between a 
particular target amino acid sequence and one or more protein 3D structures, the said 

20 protein 3D structures being analyzed by a method according to any of claims 1 to 56. 

58. An inverse folding method to identify a potential structural relationship between a 
particular protein 3D structure and one or more known amino acid sequences, wherein 
the said protein 3D structure is analyzed by a method according any of claims 1 to 56. 



25 



59. A protein design method to identify or generate amino acid sequences which are 
energetically compatible with a particular protein 3D structure, wherein the said protein 
3D structure is analyzed by a method according any of claims 1 to 56. 

30 60. A method according to any of claims 1 to 56, wherein the cost function includes an 
energetic contribution accounting for main-chain flexibility. 



61. A method according to any of claims 1 to 56, wherein the energy function includes 
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an energetic contribution accounting for solvation effects. 

62. A method according to claim 61, wherein the energetic contribution for solvation 
effects is calculated by a method including the assignment of a set of energetic 

5 solvation terms to each residue type depending on the degree of solvent exposure of 

its respective rotamers at the considered residue positions in the protein structure. 

63. A method according to claim 61, wherein the energetic contribution for solvation 

effects is a type-dependent, topology-specific solvation method including 

1 0 establishing a set of energetic parameters for each of different classes of solvent 

exposure. 

64. A method according to claim 63, wherein the classes of solvent exposure correspond 
with buried, semi-buried and solvent-exposed rotamers, respectively. 

15 

65. A method according to claim 63 or claim 64, wherein: 

- first, each residue type at a given residue position in a specific rotameric state is 
substituted into the protein structure and its accessible surface area (ASA) is 
calculated, 

20 - next, a class assignment occurs on the basis of the percentage ASA of the residue 
side chain in the protein structure compared to the maximal ASA of the same side 
chain being shielded from the solvent only by its own main-chain atoms, and 

- finally an appropriate type- and topology-specific energy, Errs(T,C), for the 
considered pattern element is retrieved from a table of TTS values, using the pattern 

25 type (T) and class (C) indices,. 

66. A method according to claim 65, wherein the table of TTS values is established by: 

- first considering a set of a number of well-resolved protein structures and assigning 
to each residue position of this set a unique index /, 

30 - then considering, at each residue position z, all residue types a in all rotameric states 
r, thereby forming the set of all possible position-type-rotamer combinations 

- defining by / r w the WT residue type at position i in a rotameric state r as observed in 
the protein comprising position i, 
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- assigning to each if a solvent exposure class C(if), 

- calculating for each if the value EpotO'") as the total potential energy the rotamer if 
experiences within the fixed context of the protein structure comprising i, 

- defining a function AE(i) according to equation (2) 

5 AE(/) = min r (EpiOT) + Etts(w,C0V w ))) - min,, min r (Ep 0t (i r a ) + EttsC^CO?))) (eq.2 

— AE(i') being the difference in energy of the WT residue type in its best possible 
rotamer and the energetically most favorable residue type at residue position i 
also in its best possible rotamer, given a set of TTS parameters, and 

- applying a suitable parameter optimization method to maximize S by varying 
10 Errs(T,C) parameters. 

67. A type-dependent, topology-specific solvation method for the assignment of a set of 
energetic solvation terms to a set of residue types, depending on the degree of solvent 
exposure of their respective rotamers at the considered residue positions in a protein 

1 5 structure, wherein: 

- first, each residue type at a given residue position in a specific rotameric state is 
substituted into the protein structure and its accessible surface area (ASA) is 
calculated, 

- next, a class assignment occurs on the basis of the percentage ASA of the residue 
20 side chain in the protein structure compared to the maximal ASA of the same side 

chain being shielded from the solvent only by its own main-chain atoms, and 

- finally an appropriate type- and topology-specific energy, Etts(T,C), for the 
considered pattern element is retrieved from a table of TTS values, using the pattern 
type (T) and class (C) indices. 

25 

68. A method according to claim 67, wherein the table of TTS values is established by: 

- first considering a set of a number of well-resolved protein structures and assigning 
to each residue position of this set a unique index i, 

- then considering, at each residue position i, all residue types a in all rotameric states 
30 r, thereby forming the set of all possible position-type-rotamer combinations if, 

- defining by if the WT residue type at position i in a rotameric state r as observed in 
the protein comprising position i, 
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- assigning to each if 2l solvent exposure class C(if) 9 

- calculating for each if the value E pol (if) as the total potential energy the rotamer if 
experiences within the fixed context of the protein structure comprising /, 

- defining a function AE(i) according to equation (2) 

5 AE(0 = min r (Epot(/ r w ) + ETrs(w,C(/ r w ))) - min* min r (E pol (/ r fl ) + Errs(a,C(/ r a ))) (eq.2 

AE(/) being the difference in energy of the WT residue type in its best possible 
rotamer and the energetically most favorable residue type at residue position / 
also in its best possible rotamer, given a set of TTS parameters, and 

- applying a suitable parameter optimization method to maximize S by varying 
10 Errs(T,C) parameters. 

69. A nucleic acid sequence encoding a protein sequence analyzed by a method 
according to any of claims 1 to 56. 

70. An expression vector comprising the nucleic acid sequence of claim 69. 
15 71. A host cell comprising the nucleic acid sequence of claim 69. 

72. A pharmaceutical composition comprising a therapeutically effective amount of a 
protein sequence analyzed by a method according to any of claims 1 to 56 and a 
pharmaceutically acceptable carrier. 

73. A method of treating a disease in a mammal, comprising administering to said 
20 mammal a pharmaceutical composition according to claim 72. 

74. An ECO obtainable by a method according to any of claims 1 to 56. 

75. A database in the form of a data structure comprising a set of ECO's obtainable by 
a method according to any of claims 1 to 56. 

25 76. A computing device for analyzing a protein structure, comprising: 
means for receiving a reference structure for a protein, whereby 
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- the said reference structure forms a representation of a 3D structure of the said 
protein, and 

- the protein consists of a plurality of residue positions, each carrying a particular 
reference amino acid type in a specific reference conformation, and 

5 - the protein residues are classified into a set of modeled residue positions and a set 

of fixed residues, the latter being included into a fixed template; 
means for substituting into the reference structure of step (A) a pattern, whereby 

- the said pattern consists of one or more of the modeled residue positions defined 
in step (A), each carrying a particular amino acid residue type in a fixed 

10 conformation, and 

- the one or more amino acid residue types of the said pattern are replacing the 
corresponding amino acid residue types present in the reference structure; 

means for optimizing the global conformation of the reference structure of step (A) 
being substituted by the pattern of step (B), whereby 
15 - a suitable protein structure optimization method based on a function allowing to 

assess the quality of a global protein structure, or any part thereof, is used in 
combination with a suitable conformational search method, and 

- the said structure optimization method is applied to all modeled residue positions 
defined in step (A) not being located at any of the pattern residue positions 

20 defined in step (B), and 

- the pattern and template residues are kept fixed; 

means for assessing the energetic compatibility of the pattern defined in step (B) within 
the context of the reference structure defined in step (A) being structurally optimized in 
step (C) with respect to the said pattern, by way of comparing the global energy of the 
25 substituted and optimized protein structure with the global energy of the non-substituted 
reference structure; and 

memory for storing a value reflecting the said energetic compatibility of the pattern 
together with information related to the structure of the pattern in the form of an 
energetic compatibility object as a data structure. 



30 



77. A computing device for computing a type-dependent, topology-specific solvation 
for the assignment of a set of energetic solvation terms to a set of residue types, 
depending on the degree of solvent exposure of their respective rotamers at the 
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considered residue positions in a protein structure, comprising: 

- Means for substituting each residue type at a given residue position in a specific 
rotameric state into the protein structure and its accessible surface area (ASA) is 
calculated, 

5 - Means for a class assigning on the basis of the percentage ASA of the residue side 
chain in the protein structure compared to the maximal ASA of the same side chain 
being shielded from the solvent only by its own main-chain atoms, and 

- Means for retrieving an appropriate type- and topology. specific energy, E-rrs(T,C), 
for the considered pattern element from a table of TTS values in the form of adata 

1 0 structure, using the pattern type (T) and class (C) indices. 

78. A computer program product to be utilized for computing on a computing system 

with a processor and memory, comprising: 
instruction means for receiving a reference structure for a protein, whereby 

- the said reference structure forms a representation of a 3D structure of the said 
protein, and 

- the protein consists of a plurality of residue positions, each carrying a particular 
reference amino acid type in a specific reference conformation, and 

- the protein residues are classified into a set of modeled residue positions and.a set 
20 of fixed residues, the latter being included into a fixed template; 

instruction means for substituting into the reference structure of step (A) a pattern, 
whereby 

- the said pattern consists of one or more of the modeled residue positions defined 
in step (A), each carrying a particular amino acid residue type in a fixed 

25 conformation, and 

- the one or more amino acid residue types of the said pattern are replacing the 
corresponding amino acid residue types present in the reference structure; 

instruction means for optimizing the global conformation of the reference structure 
of step (A) being substituted by the pattern of step (B), whereby 

- a suitable protein structure optimization method based on a function allowing to 
assess the quality of a global protein structure, or any part thereof, is used in 
combination with a suitable conformational search method, and 
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- the said structure optimization method is applied to all modeled residue 
positions defined in step (A) not being located at any of the pattern residue 
positions defined in step (B), and 

- the pattern and template residues are kept fixed; 

5 instruction means for assessing the energetic compatibility of the pattern defined in step 
(B) within the context of the reference structure defined in step (A) being structurally 
optimized in step (C) with respect to the said pattern, by way of comparing the global 
energy of the substituted and optimized protein structure with the global energy of the 
non-substituted reference structure; and 

10 instruction means for storing a value reflecting the said energetic compatibility of the 
pattern together with information related to the structure of the pattern in the form of an 
energetic compatibility object as a data structure inmemory. 

79. A computer program product to be utilized with a computer system having a 
1 5 memory and a processor for computing a type-dependent, topology-specific solvation 
for the assignment of a set of energetic solvation terms to a set of residue types, 
depending on the degree of solvent exposure of their respective rotamers at the 
considered residue positions in a protein structure, comprising: 

- Instruction means for substituting each residue type at a given residue position in a 
20 specific rotameric state into the protein structure and its accessible surface area 

(ASA) is calculated, 

- Instruction means for a class assigning on the basis of the percentage ASA of the 
residue side chain in the protein structure compared to the maximal ASA of the 
same side chain being shielded from the solvent only by its own main-chain atoms, 

25 and 

- Instruction means for retrieving an appropriate type- and topology-specific energy, 
Etts(T,C), for me considered pattern element from a table of TTS values in the form 
of adata structure, using the pattern type (T) and class (C) indices. 

30 80. A computer readable data carrier comprising an executable computer program 
product in accordance with claims 78 or 79 or for executing any of the methods of 
claims 1 to 66. 
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8 1 . A method comprising the steps of 

inputing a description of a least a protein reference structure at a near location; 

transmitting the description to a remote processing engine running a computer 
program for carrying out any of the methods of claims 1 to 66; 

receiving at a near location from the remote processing engine an output of a 

method in accordance with any of the claims 1 to 66. 
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FIGURE 2 
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